MLFF Hangs Before First Ionic Step when ML_AB is Present
Posted: Wed Aug 23, 2023 12:05 am
Hi everyone,
I very much enjoy the new MLFF feature; however, I have encountered a few issues fairly consistently while attempting to run with pre-existing ML_AB files...
When attempting to restart a run with a somewhat large ML_AB (in this particular case it contains 1475 structures, 6 elements, max 3124 basis configs), VASP will hang after the SCF cycle has converged. After this first step is completed, the line reporting the temperature/energy/etc. is not written to the OSZICAR & stdout, and none of the OUTCAR, stdout, vasprun.xml, ML_ABN, or ML_REG are updated after the positions and total forces have been written to the OUTCAR/vasprun.
The ML_LOGFILE, after all of the initial boilerplate, shows this:
This has happened multiple times upon cancelling and restarting this job, while the same INCAR & POSCAR without the ML_AB file runs without a problem. I left one instance running for 2 days before checking on it and seeing this, so I'm fairly certain nothing more was going to happen.
This is with a fresh compilation of VASP 6.4.2 using gcc 11, OpenMPI 4.1.5, OpenBLAS, and the AMD optimized Scalapack & FFTW. The precompiler flags use_shmem, shmem_bcast_buffer, shmem_rproj, and sysv are all used. Slurm reports that 286GB of memory were used, well within the 500GB provided to it.
Any help with diagnosing this issue would be greatly appreciated!
Thank you very much,
Vivienne
I very much enjoy the new MLFF feature; however, I have encountered a few issues fairly consistently while attempting to run with pre-existing ML_AB files...
When attempting to restart a run with a somewhat large ML_AB (in this particular case it contains 1475 structures, 6 elements, max 3124 basis configs), VASP will hang after the SCF cycle has converged. After this first step is completed, the line reporting the temperature/energy/etc. is not written to the OSZICAR & stdout, and none of the OUTCAR, stdout, vasprun.xml, ML_ABN, or ML_REG are updated after the positions and total forces have been written to the OUTCAR/vasprun.
The ML_LOGFILE, after all of the initial boilerplate, shows this:
Code: Select all
STATUS 0 learning 3 T F 0 0
SPRSC 0 1475 1475 Al 550 550 C 1893 1893 H 3124 3124 O 2952 2952 Zn 1237 1237 P 170 170
REGR 0 1 1 4.80520352E-01 2.70686887E-02 3.88934899E-13 6.60186318E+03
REGR 0 1 2 2.65297417E+00 2.06119355E-02 5.36422807E-14 5.00693884E+03
REGR 0 1 3 9.60078933E+00 1.83151329E-02 1.31711779E-14 4.43880639E+03
REGR 0 1 4 2.31145840E+01 1.73880995E-02 5.19382743E-15 4.20853094E+03
REGR 0 1 5 3.98908662E+01 1.70119314E-02 2.94443274E-15 4.11453918E+03
REGR 0 1 6 5.50733221E+01 1.68483116E-02 2.11220757E-15 4.07342537E+03
REGR 0 1 7 6.64584665E+01 1.67694861E-02 1.74217156E-15 4.05352463E+03
REGR 0 1 8 7.41236092E+01 1.67285109E-02 1.55819644E-15 4.04314007E+03
REGR 0 1 9 7.89711368E+01 1.67062112E-02 1.46059921E-15 4.03747233E+03
This is with a fresh compilation of VASP 6.4.2 using gcc 11, OpenMPI 4.1.5, OpenBLAS, and the AMD optimized Scalapack & FFTW. The precompiler flags use_shmem, shmem_bcast_buffer, shmem_rproj, and sysv are all used. Slurm reports that 286GB of memory were used, well within the 500GB provided to it.
Any help with diagnosing this issue would be greatly appreciated!
Thank you very much,
Vivienne