how to manually generate the ML_AB file

Message

Qiuju Zhang · #1 Post by **Qiuju Zhang** » Mon Feb 21, 2022 4:12 am

Dear all, is it possible to generate ML_AB files from already done conventional MD calculations for MLFF to learn?

#2 Post by **ferenc_karsai** » Mon Feb 21, 2022 8:25 am

Yes it is possible. But that feature is still not strongly tested and fully supported.

If you still want to do it, do the following steps:

-) First prepare an ML_AB that looks like this:
wiki/index.php/ML_AB

Please mind that the order of appearance of element types on line 13 ("The atom types in the data file") has to be the same as they later appear in the configurations. That means for example if the first system is Fe_xNi_y and the second is Co_xSi_y this line would have to be written as

**************************************************
The atom types in the data file
--------------------------------------------------
Fe Ni Co Si

Please also mind that the reference atomic energies and atomic masses also depend on this order. In your case you can set the reference atomic energies to 0.

The Basis sets (local reference configurations) for the elements can be set to 1. This part is ignored but dummy values have to be set so that the reader works correctly.
So for example you would write:
**************************************************
The numbers of basis sets per atom type
--------------------------------------------------
1 1 1 1
**************************************************
Basis set for Fe
--------------------------------------------------
1 1
**************************************************
Basis set for Ni
--------------------------------------------------
1 1
**************************************************
Basis set for Co
--------------------------------------------------
1 1
**************************************************
Basis set for Si
--------------------------------------------------
1 1

Also very important: Training structures have to be properly grouped together and given unique names. That means training structures containing the same element types and the same number of atoms per element belong to the same group.

This strict ordering of structures and elements will be lifted in the next update, so that the user doesn't necessarily have to be so strict with naming and ordering. Nevertheless it's good practice to order the data correctly.

-) Second run a calculation using ML_ISTART=3:
wiki/index.php/ML_ISTART

This calculation will loop over all existing training structures, read them in one by one and simulate an on the fly simulation. The entire purpose of this is to select the local reference configurations which are part of the force field. Beware, this step can be quite time consuming.

At this step you get a new ML_AB (ML_ABN) file but also an ML_FFN that can be used.

-) Optionally you may want to refine your force field using the new ML_AB file. For that please have a look at this site:
wiki/index.php/Machine_learning_force_f ... rce_fields

Qiuju Zhang · #3 Post by **Qiuju Zhang** » Wed Feb 23, 2022 6:53 am

Dear Prof. Ferenc Karsai,

Thank you so much for your kind reply! According to your instructions, I manually built the ML_AB file. But in the second step (run a calculation using ML_ISTART=3), it seems to run into a memory problem:

Code: Select all

 LDA part: xc-table for Pade appr. of Perdew
 Machine learning selected
 Setting communicators for machine learning
 Initializing machine learning
 Starting to select new local configurations from ML_AB file (ML_FF_ISTART=3):
Insufficient memory to allocate Fortran RTL message buffer, message #41 = hex 00000029.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 140995 RUNNING AT node41
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
...
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 47 PID 141041 RUNNING AT node41
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
forrtl: severe (41): insufficient virtual memory
Image              PC                Routine            Line        Source
vasp_std           0000000001D586FB  Unknown               Unknown  Unknown
vasp_std           00000000005F3971  Unknown               Unknown  Unknown
vasp_std           00000000005EB217  Unknown               Unknown  Unknown
vasp_std           00000000005E81FA  Unknown               Unknown  Unknown
vasp_std           00000000006B80D6  Unknown               Unknown  Unknown
vasp_std           00000000006B15C6  Unknown               Unknown  Unknown
vasp_std           00000000006BB8FE  Unknown               Unknown  Unknown
vasp_std           00000000010529EA  Unknown               Unknown  Unknown
vasp_std           0000000001CA65CE  Unknown               Unknown  Unknown
vasp_std           000000000040DBA2  Unknown               Unknown  Unknown
libc-2.17.so       00002AAB57EF9555  __libc_start_main     Unknown  Unknown
vasp_std           000000000040DAA9  Unknown               Unknown  Unknown

Do you have any suggestions for this?

Best,
Pan

ferenc_karsai wrote: ↑Mon Feb 21, 2022 8:25 am Yes it is possible. But that feature is still not strongly tested and fully supported.

If you still want to do it, do the following steps:

-) First prepare an ML_AB that looks like this:
wiki/index.php/ML_AB

Please mind that the order of appearance of element types on line 13 ("The atom types in the data file") has to be the same as they later appear in the configurations. That means for example if the first system is Fe_xNi_y and the second is Co_xSi_y this line would have to be written as

**************************************************
The atom types in the data file
--------------------------------------------------
Fe Ni Co Si

Please also mind that the reference atomic energies and atomic masses also depend on this order. In your case you can set the reference atomic energies to 0.

The Basis sets (local reference configurations) for the elements can be set to 1. This part is ignored but dummy values have to be set so that the reader works correctly.
So for example you would write:
**************************************************
The numbers of basis sets per atom type
--------------------------------------------------
1 1 1 1
**************************************************
Basis set for Fe
--------------------------------------------------
1 1
**************************************************
Basis set for Ni
--------------------------------------------------
1 1
**************************************************
Basis set for Co
--------------------------------------------------
1 1
**************************************************
Basis set for Si
--------------------------------------------------
1 1

Also very important: Training structures have to be properly grouped together and given unique names. That means training structures containing the same element types and the same number of atoms per element belong to the same group.

This strict ordering of structures and elements will be lifted in the next update, so that the user doesn't necessarily have to be so strict with naming and ordering. Nevertheless it's good practice to order the data correctly.

-) Second run a calculation using ML_ISTART=3:
wiki/index.php/ML_ISTART

This calculation will loop over all existing training structures, read them in one by one and simulate an on the fly simulation. The entire purpose of this is to select the local reference configurations which are part of the force field. Beware, this step can be quite time consuming.

At this step you get a new ML_AB (ML_ABN) file but also an ML_FFN that can be used.

-) Optionally you may want to refine your force field using the new ML_AB file. For that please have a look at this site:
wiki/index.php/Machine_learning_force_f ... rce_fields

#4 Post by **ferenc_karsai** » Wed Feb 23, 2022 10:29 am

The training of the force field is generally memory consuming.
Especially if one has many training structures with lots of different element types.

Please provide some information about your job. How many training structures do you have, what is the number of types and what is the maximum number of atoms per structure? You can also upload your ML_AB file so that I can check it.
Did you compile using mpi shared memory (-Duse_shmem precompiler option)?

The largest matrix that needs to be stored is the design matrix.
It's dimension is number_of_training_structures*(3*N_atom+7)*local_reference_configurations. At the beginning of the ML_ISTART=3 you don't know the number of local reference configurations, but you have to set a maximum according to ML_MB (I usually set it to be the same as the number of training structures, but it's system dependent; possibly you have to repeat the calculation afterwards). The maximum number of training structures needs also to be set but that can be chosen since you know how many training structures you have in your ML_AB file. The design matrix is then statically allocated at the beginning of the calculation. At the beginning of the calculation the estimated memory is printed out in the ML_LOGFILE before the actual allocations are done. So you can see how much more you need to possibly fit into your available memory. The entry "FMAT for basis" is the required memory for the design matrix.
Please also read this wiki entry about the memory estimation in the ML_LOGFILE:
wiki/index.php/ML_LOGFILE#Memory_consumption_estimation

The design matrix is fully parallelized also in memory. So the more cores you use the less memory it needs per core. This way if you go to more nodes you possibly can fit it into the memory.

Another very important point is shared memory:
The covariance matrix and parts of the descriptors need to be present at every core in it's full size ("CMAT for basis" and "DESC for basis" in the ML_LOGFILE). If mpi shared memory is not used these matrices are allocated on every core. With shared memory these matrices are allocated only once per node. So without shared memory usage one can strongly be limited in memory. So please check if you use this capability.
Please also see this wiki entry on memory usage and shared memory:
wiki/index.php/Machine_learning_force_f ... mory_usage

Qiuju Zhang · #5 Post by **Qiuju Zhang** » Thu Feb 24, 2022 4:47 am

Dear Prof. Ferenc Karsai,

Thank you so much for your quick reply!

I have seen the Shared memory with MPI before, but I did not use it due to the potential risks on the cluster. Anyway, since you mention it, I will try it later. Here I would like to upload the ML_AB file first, which contains a total of 10,000 configurations with 374 atoms each.

BTW, I run the calculation with "ML_MCONF = 12000; ML_MB = 12000". In the ML_LOGFILE file, "Total memory consumption" is 131953.9 MB (ca. 129 GB). The calculation node has a total of 376 G of RAM, but the calculation still failed with a memory error, which I can not understand.

Best,
Pan

PS: since the full ML_AB file is too large (86.7MB after compression) exceeding the upload file size limit, I just keep a few configurations in ML_AB.

testML.tgz

========================================================================

ferenc_karsai wrote: ↑Wed Feb 23, 2022 10:29 am The training of the force field is generally memory consuming.
Especially if one has many training structures with lots of different element types.

Please provide some information about your job. How many training structures do you have, what is the number of types and what is the maximum number of atoms per structure? You can also upload your ML_AB file so that I can check it.
Did you compile using mpi shared memory (-Duse_shmem precompiler option)?

The largest matrix that needs to be stored is the design matrix.
It's dimension is number_of_training_structures*(3*N_atom+7)*local_reference_configurations. At the beginning of the ML_ISTART=3 you don't know the number of local reference configurations, but you have to set a maximum according to ML_MB (I usually set it to be the same as the number of training structures, but it's system dependent; possibly you have to repeat the calculation afterwards). The maximum number of training structures needs also to be set but that can be chosen since you know how many training structures you have in your ML_AB file. The design matrix is then statically allocated at the beginning of the calculation. At the beginning of the calculation the estimated memory is printed out in the ML_LOGFILE before the actual allocations are done. So you can see how much more you need to possibly fit into your available memory. The entry "FMAT for basis" is the required memory for the design matrix.
Please also read this wiki entry about the memory estimation in the ML_LOGFILE:
wiki/index.php/ML_LOGFILE#Memory_consumption_estimation

The design matrix is fully parallelized also in memory. So the more cores you use the less memory it needs per core. This way if you go to more nodes you possibly can fit it into the memory.

Another very important point is shared memory:
The covariance matrix and parts of the descriptors need to be present at every core in it's full size ("CMAT for basis" and "DESC for basis" in the ML_LOGFILE). If mpi shared memory is not used these matrices are allocated on every core. With shared memory these matrices are allocated only once per node. So without shared memory usage one can strongly be limited in memory. So please check if you use this capability.
Please also see this wiki entry on memory usage and shared memory:
wiki/index.php/Machine_learning_force_f ... mory_usage

#6 Post by **ferenc_karsai** » Fri Feb 25, 2022 8:16 am

Ok 10000 training structures is really huge for kernel methods. Setting ML_MCONF to 12000 is absolutely unneccessary. Just set it to 10050. Same for ML_MB.
"Total memory consumption" means total memory for the first core. Of course if you use more cores than multiply that by the number of cores you have. So you used 48 cores I saw in the OUTCAR file. Iguess you have at least 8 cores per node, but anyways it's hugely exceeding your memory.
One way to fit this calculation would be to use one core per node and go to multiple nodes. This way you have the entire memory per node (you don't even have to use shared memory then) and the memory demand per node for the design matrix would roughly scale with the number of nodes then. Nevertheless the calculation will take really long (several days), so prepare for that.

I saw you set a value for CTIFOR in the ML_AB file for the structures. Is this because the forces were obtained from separate VASP machine learning calculations before and now you want to combine the files? If this CTIFOR values are supplied everywhere in the file than when these structures are added to the data base this CTIFOR value is used to select local reference configurations. Just beware of that. Also you cannot combine structures containing a CTIFOR value and structures without! If no CTIFOR value is contained for all structures than the this algorithm is used:
wiki/index.php/Machine_learning_force_f ... _of_forces

Also beware many of your atoms I saw have 0 forces, so providing 10000 structures with a lot of zero forces is not such a good thing.

My Community

how to manually generate the ML_AB file

how to manually generate the ML_AB file

Re: how to manually generate the ML_AB file

Re: how to manually generate the ML_AB file

Re: how to manually generate the ML_AB file

Re: how to manually generate the ML_AB file

Re: how to manually generate the ML_AB file