Continuation of Machine Learning Jobs
Posted: Tue May 10, 2022 4:07 pm
Dear VASP-Team,
I am doing ML-FF generation calculations on different systems, some of them are rather large and require quite long times for single AIMD steps.
Since most of our available computing clusters usually have walltime-restrictions of one day, the problem often arises that I am only able to calculate the first, e.g., 500 steps of the ML process (with ML_ISTART = 0).
If I then do the straightforward thing and start a second calculation based on the ML_AB file of the first calculation with ML_ISTART = 1, the learning process essentially needs as many AIMD steps as before, i.e., the first 10-12 steps are always calculated with AIMD and every 3rd to 5th step is calculated that way the following hundreds of steps. I am quite puzzled by this, since essentially the same dynamics done at once would of course lead to a successive convergence, such that AIMD steps are required much less often after the first few hundreds or thousands MD steps.
This behavior so far severly limits my ability to generate machine learning force fields for larger systems, since each 1-day calculation only covers around 500 MD steps, which a much too high ratio of AIMD to ML-FF and (presumably) a quite blown up training set covering only a small portion of configuration space, even if I repeat the process 10 times or so. Real convergence seems to be almost unreachable.
Is it possible to modify some of the input settings such that a calculation started with ML_ISTART = 1 behaves such it would be indeed a direct continuation of a previous ML-FF generation calculation, with for example only each 10th to 50th MD step being calculated with DFT directly from the beginning, depending on the progress being done in the previous calculation(s)?
Thank you in advance,
Julien
I am doing ML-FF generation calculations on different systems, some of them are rather large and require quite long times for single AIMD steps.
Since most of our available computing clusters usually have walltime-restrictions of one day, the problem often arises that I am only able to calculate the first, e.g., 500 steps of the ML process (with ML_ISTART = 0).
If I then do the straightforward thing and start a second calculation based on the ML_AB file of the first calculation with ML_ISTART = 1, the learning process essentially needs as many AIMD steps as before, i.e., the first 10-12 steps are always calculated with AIMD and every 3rd to 5th step is calculated that way the following hundreds of steps. I am quite puzzled by this, since essentially the same dynamics done at once would of course lead to a successive convergence, such that AIMD steps are required much less often after the first few hundreds or thousands MD steps.
This behavior so far severly limits my ability to generate machine learning force fields for larger systems, since each 1-day calculation only covers around 500 MD steps, which a much too high ratio of AIMD to ML-FF and (presumably) a quite blown up training set covering only a small portion of configuration space, even if I repeat the process 10 times or so. Real convergence seems to be almost unreachable.
Is it possible to modify some of the input settings such that a calculation started with ML_ISTART = 1 behaves such it would be indeed a direct continuation of a previous ML-FF generation calculation, with for example only each 10th to 50th MD step being calculated with DFT directly from the beginning, depending on the progress being done in the previous calculation(s)?
Thank you in advance,
Julien