how to manually generate the ML_AB file
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 5
- Joined: Tue Nov 12, 2019 1:23 am
how to manually generate the ML_AB file
Dear all, is it possible to generate ML_AB files from already done conventional MD calculations for MLFF to learn?
-
- Global Moderator
- Posts: 460
- Joined: Mon Nov 04, 2019 12:44 pm
Re: how to manually generate the ML_AB file
Yes it is possible. But that feature is still not strongly tested and fully supported.
If you still want to do it, do the following steps:
-) First prepare an ML_AB that looks like this:
wiki/index.php/ML_AB
Please mind that the order of appearance of element types on line 13 ("The atom types in the data file") has to be the same as they later appear in the configurations. That means for example if the first system is Fe_xNi_y and the second is Co_xSi_y this line would have to be written as
**************************************************
The atom types in the data file
--------------------------------------------------
Fe Ni Co Si
Please also mind that the reference atomic energies and atomic masses also depend on this order. In your case you can set the reference atomic energies to 0.
The Basis sets (local reference configurations) for the elements can be set to 1. This part is ignored but dummy values have to be set so that the reader works correctly.
So for example you would write:
**************************************************
The numbers of basis sets per atom type
--------------------------------------------------
1 1 1 1
**************************************************
Basis set for Fe
--------------------------------------------------
1 1
**************************************************
Basis set for Ni
--------------------------------------------------
1 1
**************************************************
Basis set for Co
--------------------------------------------------
1 1
**************************************************
Basis set for Si
--------------------------------------------------
1 1
Also very important: Training structures have to be properly grouped together and given unique names. That means training structures containing the same element types and the same number of atoms per element belong to the same group.
This strict ordering of structures and elements will be lifted in the next update, so that the user doesn't necessarily have to be so strict with naming and ordering. Nevertheless it's good practice to order the data correctly.
-) Second run a calculation using ML_ISTART=3:
wiki/index.php/ML_ISTART
This calculation will loop over all existing training structures, read them in one by one and simulate an on the fly simulation. The entire purpose of this is to select the local reference configurations which are part of the force field. Beware, this step can be quite time consuming.
At this step you get a new ML_AB (ML_ABN) file but also an ML_FFN that can be used.
-) Optionally you may want to refine your force field using the new ML_AB file. For that please have a look at this site:
wiki/index.php/Machine_learning_force_f ... rce_fields
If you still want to do it, do the following steps:
-) First prepare an ML_AB that looks like this:
wiki/index.php/ML_AB
Please mind that the order of appearance of element types on line 13 ("The atom types in the data file") has to be the same as they later appear in the configurations. That means for example if the first system is Fe_xNi_y and the second is Co_xSi_y this line would have to be written as
**************************************************
The atom types in the data file
--------------------------------------------------
Fe Ni Co Si
Please also mind that the reference atomic energies and atomic masses also depend on this order. In your case you can set the reference atomic energies to 0.
The Basis sets (local reference configurations) for the elements can be set to 1. This part is ignored but dummy values have to be set so that the reader works correctly.
So for example you would write:
**************************************************
The numbers of basis sets per atom type
--------------------------------------------------
1 1 1 1
**************************************************
Basis set for Fe
--------------------------------------------------
1 1
**************************************************
Basis set for Ni
--------------------------------------------------
1 1
**************************************************
Basis set for Co
--------------------------------------------------
1 1
**************************************************
Basis set for Si
--------------------------------------------------
1 1
Also very important: Training structures have to be properly grouped together and given unique names. That means training structures containing the same element types and the same number of atoms per element belong to the same group.
This strict ordering of structures and elements will be lifted in the next update, so that the user doesn't necessarily have to be so strict with naming and ordering. Nevertheless it's good practice to order the data correctly.
-) Second run a calculation using ML_ISTART=3:
wiki/index.php/ML_ISTART
This calculation will loop over all existing training structures, read them in one by one and simulate an on the fly simulation. The entire purpose of this is to select the local reference configurations which are part of the force field. Beware, this step can be quite time consuming.
At this step you get a new ML_AB (ML_ABN) file but also an ML_FFN that can be used.
-) Optionally you may want to refine your force field using the new ML_AB file. For that please have a look at this site:
wiki/index.php/Machine_learning_force_f ... rce_fields
-
- Newbie
- Posts: 5
- Joined: Tue Nov 12, 2019 1:23 am
Re: how to manually generate the ML_AB file
Dear Prof. Ferenc Karsai,
Thank you so much for your kind reply! According to your instructions, I manually built the ML_AB file. But in the second step (run a calculation using ML_ISTART=3), it seems to run into a memory problem:
Do you have any suggestions for this?
Best,
Pan
Thank you so much for your kind reply! According to your instructions, I manually built the ML_AB file. But in the second step (run a calculation using ML_ISTART=3), it seems to run into a memory problem:
Code: Select all
LDA part: xc-table for Pade appr. of Perdew
Machine learning selected
Setting communicators for machine learning
Initializing machine learning
Starting to select new local configurations from ML_AB file (ML_FF_ISTART=3):
Insufficient memory to allocate Fortran RTL message buffer, message #41 = hex 00000029.
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 140995 RUNNING AT node41
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
...
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 47 PID 141041 RUNNING AT node41
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
forrtl: severe (41): insufficient virtual memory
Image PC Routine Line Source
vasp_std 0000000001D586FB Unknown Unknown Unknown
vasp_std 00000000005F3971 Unknown Unknown Unknown
vasp_std 00000000005EB217 Unknown Unknown Unknown
vasp_std 00000000005E81FA Unknown Unknown Unknown
vasp_std 00000000006B80D6 Unknown Unknown Unknown
vasp_std 00000000006B15C6 Unknown Unknown Unknown
vasp_std 00000000006BB8FE Unknown Unknown Unknown
vasp_std 00000000010529EA Unknown Unknown Unknown
vasp_std 0000000001CA65CE Unknown Unknown Unknown
vasp_std 000000000040DBA2 Unknown Unknown Unknown
libc-2.17.so 00002AAB57EF9555 __libc_start_main Unknown Unknown
vasp_std 000000000040DAA9 Unknown Unknown Unknown
Best,
Pan
ferenc_karsai wrote: ↑Mon Feb 21, 2022 8:25 am Yes it is possible. But that feature is still not strongly tested and fully supported.
If you still want to do it, do the following steps:
-) First prepare an ML_AB that looks like this:
wiki/index.php/ML_AB
Please mind that the order of appearance of element types on line 13 ("The atom types in the data file") has to be the same as they later appear in the configurations. That means for example if the first system is Fe_xNi_y and the second is Co_xSi_y this line would have to be written as
**************************************************
The atom types in the data file
--------------------------------------------------
Fe Ni Co Si
Please also mind that the reference atomic energies and atomic masses also depend on this order. In your case you can set the reference atomic energies to 0.
The Basis sets (local reference configurations) for the elements can be set to 1. This part is ignored but dummy values have to be set so that the reader works correctly.
So for example you would write:
**************************************************
The numbers of basis sets per atom type
--------------------------------------------------
1 1 1 1
**************************************************
Basis set for Fe
--------------------------------------------------
1 1
**************************************************
Basis set for Ni
--------------------------------------------------
1 1
**************************************************
Basis set for Co
--------------------------------------------------
1 1
**************************************************
Basis set for Si
--------------------------------------------------
1 1
Also very important: Training structures have to be properly grouped together and given unique names. That means training structures containing the same element types and the same number of atoms per element belong to the same group.
This strict ordering of structures and elements will be lifted in the next update, so that the user doesn't necessarily have to be so strict with naming and ordering. Nevertheless it's good practice to order the data correctly.
-) Second run a calculation using ML_ISTART=3:
wiki/index.php/ML_ISTART
This calculation will loop over all existing training structures, read them in one by one and simulate an on the fly simulation. The entire purpose of this is to select the local reference configurations which are part of the force field. Beware, this step can be quite time consuming.
At this step you get a new ML_AB (ML_ABN) file but also an ML_FFN that can be used.
-) Optionally you may want to refine your force field using the new ML_AB file. For that please have a look at this site:
wiki/index.php/Machine_learning_force_f ... rce_fields
-
- Global Moderator
- Posts: 460
- Joined: Mon Nov 04, 2019 12:44 pm
Re: how to manually generate the ML_AB file
The training of the force field is generally memory consuming.
Especially if one has many training structures with lots of different element types.
Please provide some information about your job. How many training structures do you have, what is the number of types and what is the maximum number of atoms per structure? You can also upload your ML_AB file so that I can check it.
Did you compile using mpi shared memory (-Duse_shmem precompiler option)?
The largest matrix that needs to be stored is the design matrix.
It's dimension is number_of_training_structures*(3*N_atom+7)*local_reference_configurations. At the beginning of the ML_ISTART=3 you don't know the number of local reference configurations, but you have to set a maximum according to ML_MB (I usually set it to be the same as the number of training structures, but it's system dependent; possibly you have to repeat the calculation afterwards). The maximum number of training structures needs also to be set but that can be chosen since you know how many training structures you have in your ML_AB file. The design matrix is then statically allocated at the beginning of the calculation. At the beginning of the calculation the estimated memory is printed out in the ML_LOGFILE before the actual allocations are done. So you can see how much more you need to possibly fit into your available memory. The entry "FMAT for basis" is the required memory for the design matrix.
Please also read this wiki entry about the memory estimation in the ML_LOGFILE:
wiki/index.php/ML_LOGFILE#Memory_consumption_estimation
The design matrix is fully parallelized also in memory. So the more cores you use the less memory it needs per core. This way if you go to more nodes you possibly can fit it into the memory.
Another very important point is shared memory:
The covariance matrix and parts of the descriptors need to be present at every core in it's full size ("CMAT for basis" and "DESC for basis" in the ML_LOGFILE). If mpi shared memory is not used these matrices are allocated on every core. With shared memory these matrices are allocated only once per node. So without shared memory usage one can strongly be limited in memory. So please check if you use this capability.
Please also see this wiki entry on memory usage and shared memory:
wiki/index.php/Machine_learning_force_f ... mory_usage
Especially if one has many training structures with lots of different element types.
Please provide some information about your job. How many training structures do you have, what is the number of types and what is the maximum number of atoms per structure? You can also upload your ML_AB file so that I can check it.
Did you compile using mpi shared memory (-Duse_shmem precompiler option)?
The largest matrix that needs to be stored is the design matrix.
It's dimension is number_of_training_structures*(3*N_atom+7)*local_reference_configurations. At the beginning of the ML_ISTART=3 you don't know the number of local reference configurations, but you have to set a maximum according to ML_MB (I usually set it to be the same as the number of training structures, but it's system dependent; possibly you have to repeat the calculation afterwards). The maximum number of training structures needs also to be set but that can be chosen since you know how many training structures you have in your ML_AB file. The design matrix is then statically allocated at the beginning of the calculation. At the beginning of the calculation the estimated memory is printed out in the ML_LOGFILE before the actual allocations are done. So you can see how much more you need to possibly fit into your available memory. The entry "FMAT for basis" is the required memory for the design matrix.
Please also read this wiki entry about the memory estimation in the ML_LOGFILE:
wiki/index.php/ML_LOGFILE#Memory_consumption_estimation
The design matrix is fully parallelized also in memory. So the more cores you use the less memory it needs per core. This way if you go to more nodes you possibly can fit it into the memory.
Another very important point is shared memory:
The covariance matrix and parts of the descriptors need to be present at every core in it's full size ("CMAT for basis" and "DESC for basis" in the ML_LOGFILE). If mpi shared memory is not used these matrices are allocated on every core. With shared memory these matrices are allocated only once per node. So without shared memory usage one can strongly be limited in memory. So please check if you use this capability.
Please also see this wiki entry on memory usage and shared memory:
wiki/index.php/Machine_learning_force_f ... mory_usage
-
- Newbie
- Posts: 5
- Joined: Tue Nov 12, 2019 1:23 am
Re: how to manually generate the ML_AB file
Dear Prof. Ferenc Karsai,
Thank you so much for your quick reply!
I have seen the Shared memory with MPI before, but I did not use it due to the potential risks on the cluster. Anyway, since you mention it, I will try it later. Here I would like to upload the ML_AB file first, which contains a total of 10,000 configurations with 374 atoms each.
BTW, I run the calculation with "ML_MCONF = 12000; ML_MB = 12000". In the ML_LOGFILE file, "Total memory consumption" is 131953.9 MB (ca. 129 GB). The calculation node has a total of 376 G of RAM, but the calculation still failed with a memory error, which I can not understand.
Best,
Pan
PS: since the full ML_AB file is too large (86.7MB after compression) exceeding the upload file size limit, I just keep a few configurations in ML_AB.
========================================================================
Thank you so much for your quick reply!
I have seen the Shared memory with MPI before, but I did not use it due to the potential risks on the cluster. Anyway, since you mention it, I will try it later. Here I would like to upload the ML_AB file first, which contains a total of 10,000 configurations with 374 atoms each.
BTW, I run the calculation with "ML_MCONF = 12000; ML_MB = 12000". In the ML_LOGFILE file, "Total memory consumption" is 131953.9 MB (ca. 129 GB). The calculation node has a total of 376 G of RAM, but the calculation still failed with a memory error, which I can not understand.
Best,
Pan
PS: since the full ML_AB file is too large (86.7MB after compression) exceeding the upload file size limit, I just keep a few configurations in ML_AB.
========================================================================
ferenc_karsai wrote: ↑Wed Feb 23, 2022 10:29 am The training of the force field is generally memory consuming.
Especially if one has many training structures with lots of different element types.
Please provide some information about your job. How many training structures do you have, what is the number of types and what is the maximum number of atoms per structure? You can also upload your ML_AB file so that I can check it.
Did you compile using mpi shared memory (-Duse_shmem precompiler option)?
The largest matrix that needs to be stored is the design matrix.
It's dimension is number_of_training_structures*(3*N_atom+7)*local_reference_configurations. At the beginning of the ML_ISTART=3 you don't know the number of local reference configurations, but you have to set a maximum according to ML_MB (I usually set it to be the same as the number of training structures, but it's system dependent; possibly you have to repeat the calculation afterwards). The maximum number of training structures needs also to be set but that can be chosen since you know how many training structures you have in your ML_AB file. The design matrix is then statically allocated at the beginning of the calculation. At the beginning of the calculation the estimated memory is printed out in the ML_LOGFILE before the actual allocations are done. So you can see how much more you need to possibly fit into your available memory. The entry "FMAT for basis" is the required memory for the design matrix.
Please also read this wiki entry about the memory estimation in the ML_LOGFILE:
wiki/index.php/ML_LOGFILE#Memory_consumption_estimation
The design matrix is fully parallelized also in memory. So the more cores you use the less memory it needs per core. This way if you go to more nodes you possibly can fit it into the memory.
Another very important point is shared memory:
The covariance matrix and parts of the descriptors need to be present at every core in it's full size ("CMAT for basis" and "DESC for basis" in the ML_LOGFILE). If mpi shared memory is not used these matrices are allocated on every core. With shared memory these matrices are allocated only once per node. So without shared memory usage one can strongly be limited in memory. So please check if you use this capability.
Please also see this wiki entry on memory usage and shared memory:
wiki/index.php/Machine_learning_force_f ... mory_usage
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 460
- Joined: Mon Nov 04, 2019 12:44 pm
Re: how to manually generate the ML_AB file
Ok 10000 training structures is really huge for kernel methods. Setting ML_MCONF to 12000 is absolutely unneccessary. Just set it to 10050. Same for ML_MB.
"Total memory consumption" means total memory for the first core. Of course if you use more cores than multiply that by the number of cores you have. So you used 48 cores I saw in the OUTCAR file. Iguess you have at least 8 cores per node, but anyways it's hugely exceeding your memory.
One way to fit this calculation would be to use one core per node and go to multiple nodes. This way you have the entire memory per node (you don't even have to use shared memory then) and the memory demand per node for the design matrix would roughly scale with the number of nodes then. Nevertheless the calculation will take really long (several days), so prepare for that.
I saw you set a value for CTIFOR in the ML_AB file for the structures. Is this because the forces were obtained from separate VASP machine learning calculations before and now you want to combine the files? If this CTIFOR values are supplied everywhere in the file than when these structures are added to the data base this CTIFOR value is used to select local reference configurations. Just beware of that. Also you cannot combine structures containing a CTIFOR value and structures without! If no CTIFOR value is contained for all structures than the this algorithm is used:
wiki/index.php/Machine_learning_force_f ... _of_forces
Also beware many of your atoms I saw have 0 forces, so providing 10000 structures with a lot of zero forces is not such a good thing.
"Total memory consumption" means total memory for the first core. Of course if you use more cores than multiply that by the number of cores you have. So you used 48 cores I saw in the OUTCAR file. Iguess you have at least 8 cores per node, but anyways it's hugely exceeding your memory.
One way to fit this calculation would be to use one core per node and go to multiple nodes. This way you have the entire memory per node (you don't even have to use shared memory then) and the memory demand per node for the design matrix would roughly scale with the number of nodes then. Nevertheless the calculation will take really long (several days), so prepare for that.
I saw you set a value for CTIFOR in the ML_AB file for the structures. Is this because the forces were obtained from separate VASP machine learning calculations before and now you want to combine the files? If this CTIFOR values are supplied everywhere in the file than when these structures are added to the data base this CTIFOR value is used to select local reference configurations. Just beware of that. Also you cannot combine structures containing a CTIFOR value and structures without! If no CTIFOR value is contained for all structures than the this algorithm is used:
wiki/index.php/Machine_learning_force_f ... _of_forces
Also beware many of your atoms I saw have 0 forces, so providing 10000 structures with a lot of zero forces is not such a good thing.