Page 1 of 1

MLFF training stuck after first ionic step

Posted: Fri Oct 21, 2022 3:26 pm
by reach2sayan
Hi,

I am trying to fit a MLFF using VASP. I wanted to check if my installations were perfect. So I tried an example. I have attached the INCAR, OUTCAR, POSCAR, ML_LOGFILE and the stdout here (also the ICONST since I want to sample a liquid phase at high T).

If I remove the ML tags and run just an MD, then everything is fine. In fact, I zeroed down the MD hyperparams (LANGEVIN_GAMMA etc) by doing just that. However when I start the ML training, then the training is stuck after the 1st set of electronic steps converge (as you can see from the output files). It stayed like that for like 6 hours before I canceled it.

I wonder what I'm doing wrong. Most probably it could be the installation itself? Thank you for the kind help.

Best
Sayan

Re: MLFF training stuck after first ionic step

Posted: Fri Oct 21, 2022 3:40 pm
by reach2sayan
Sorry I saw I should also post KPOINTS and jobscript. I compiled it on SDSC PSC Bridges https://www.psc.edu/resources/bridges-2/

KPOINTS
Si
0 0 0
Gamma
4 4 4
0 0 0

jobscript
#!/bin/bash

#SBATCH -t 48:00:00
#SBATCH -p RM
#SBATCH --nodes 2
#SBATCH --ntasks-per-node=120

ulimit -s unlimited
module load intel intelmpi cuda hdf5 # same ones with which it was compiled
export OMP_NUM_THREADS=1

mpirun vasp.6.3.2/bin/vasp_std > vasp.out

Re: MLFF training stuck after first ionic step

Posted: Thu Oct 27, 2022 9:40 am
by ferenc_karsai
I just ran your calculation it ran without any problem. I also tried it with 8 and 128 cores and it ran fine.
So it is most likely a problem of your installation.

Try the following:
-) Compile without scaLAPACK (remove -DscaLAPACK from your CPP_OPTIONS in the makefile.include).
-) Compile wihout shared memory (remove -Duse_shmem in CPP_OPTIONS).

You used 240 in your calculation.
Don't use so many It's enough to try it with 8 cores.

I also saw that you have TEBEG=1800 and TEEND=800 in your calculation. Never run cooling runs in on-the-fly machine learning. Always use heating runs. Otherwise the automatic threshold determination can get stuck. This is also explained on our best practices wiki page:
wiki/index.php/Best_practices_for_machi ... rce_fields

Re: MLFF training stuck after first ionic step

Posted: Mon Nov 14, 2022 9:16 pm
by reach2sayan
Hi,

Possibly it was the issue with wither libbeef installation or I was running out of stack size. But now it is fixed. I also fixed the other issue you suggested (it was just a check to see if ML training worked).

Best
Sayan

Re: MLFF training stuck after first ionic step

Posted: Tue Nov 15, 2022 8:21 am
by ferenc_karsai
Good to hear everything works now. Thanks for sharing your solution.