Problems with vasp on hpc machines with high load on the file server
Posted: Thu Jan 18, 2024 1:27 pm
Dear VASP-Community,
during the last weeks I ran into some problems when doing VASP-calculations. I am using vasp version 6.3.2.
I am investigating tungesten trioxide clusters. Therefore I mainly used the gamma-only version vasp_gam so far. But as soon as many people are using the cluster and there is high or moderate load on the file server connected to the hpc machine a really high percentage of the calculations slow down significantly at a random point of the calculation. Most of the time, these jobs are not able to finish within 24h even if they should do. When I restart these calculations during a time period, when there is no high load on the file server, the same job is perfectly able to finish within a few hours. When looking at the cluster cockpit I noticed for these jobs the Gflops go down to zero as soon as the problem occurs even if the CPU-load stays constant. This fits to the observation that the calculation just seems to stop and wait. I attached a corresponding INCAR, POSCAR, submit.sh and vasp.log and vaspout.h5 VASP output file to this mail together with the error message .err and the cluster related output .out. You can find these files in "random_slow_down.tar.gz" in the attachments. But since I observed this problem for lots of my settings and configurations I doubt that it is explicitly connected to the specific input. I additionally attached the data about cpu load, memory and Gflops from the cluster cockpit in "random_slow_down.tar.gz"; I used 4 nodes and each color corresponds to the data of one node. If any further data regarding the job is missing please don't hesitate to contact me.
I observed this problem earlier when I still used vasp version 5.4.4.pl2 so I don't think that it is a problem with the explicit version neither.
I also tried to use the standard version vasp_std instead of vasp_gam. When using this version the sudden breakdown of Gflops seemed to become indeed less likely even if it still ocurred in some of my jobs. But using the standard version lead to another problem. Almost all of my calculations done during high load on the file server ended at a sudden point of the calculation with the error message:
"internal error in: vhdf5.F at line: 394
HDF5 call in vhdf5.F:394 produced error: -1 "
The problem is also non deterministic, running the same calculation does not necessarily reproduce the error. I attached a corresponding INCAR, POSCAR, submit.sh and vasp related output together with the whole error message .err and cluster related output .out in "error_in_hdf5.tar.gz". Additionally I attached the data from the cluster cockpit in "error_in_hdf5.tar.gz". If you need more information about the job I would be happy to send it to you. This error also occasionally occurs when using version vasp_gam but became more frequently with vasp_std.
Since these problems seem to be connected with I/O, is there a possibility to make it easier for VASP to store the output, especially the hdf5 file? Is there for example a possibility to choose the path where the vaspout.h5 will be stored?
Best regards,
Emmi Gareis
during the last weeks I ran into some problems when doing VASP-calculations. I am using vasp version 6.3.2.
I am investigating tungesten trioxide clusters. Therefore I mainly used the gamma-only version vasp_gam so far. But as soon as many people are using the cluster and there is high or moderate load on the file server connected to the hpc machine a really high percentage of the calculations slow down significantly at a random point of the calculation. Most of the time, these jobs are not able to finish within 24h even if they should do. When I restart these calculations during a time period, when there is no high load on the file server, the same job is perfectly able to finish within a few hours. When looking at the cluster cockpit I noticed for these jobs the Gflops go down to zero as soon as the problem occurs even if the CPU-load stays constant. This fits to the observation that the calculation just seems to stop and wait. I attached a corresponding INCAR, POSCAR, submit.sh and vasp.log and vaspout.h5 VASP output file to this mail together with the error message .err and the cluster related output .out. You can find these files in "random_slow_down.tar.gz" in the attachments. But since I observed this problem for lots of my settings and configurations I doubt that it is explicitly connected to the specific input. I additionally attached the data about cpu load, memory and Gflops from the cluster cockpit in "random_slow_down.tar.gz"; I used 4 nodes and each color corresponds to the data of one node. If any further data regarding the job is missing please don't hesitate to contact me.
I observed this problem earlier when I still used vasp version 5.4.4.pl2 so I don't think that it is a problem with the explicit version neither.
I also tried to use the standard version vasp_std instead of vasp_gam. When using this version the sudden breakdown of Gflops seemed to become indeed less likely even if it still ocurred in some of my jobs. But using the standard version lead to another problem. Almost all of my calculations done during high load on the file server ended at a sudden point of the calculation with the error message:
"internal error in: vhdf5.F at line: 394
HDF5 call in vhdf5.F:394 produced error: -1 "
The problem is also non deterministic, running the same calculation does not necessarily reproduce the error. I attached a corresponding INCAR, POSCAR, submit.sh and vasp related output together with the whole error message .err and cluster related output .out in "error_in_hdf5.tar.gz". Additionally I attached the data from the cluster cockpit in "error_in_hdf5.tar.gz". If you need more information about the job I would be happy to send it to you. This error also occasionally occurs when using version vasp_gam but became more frequently with vasp_std.
Since these problems seem to be connected with I/O, is there a possibility to make it easier for VASP to store the output, especially the hdf5 file? Is there for example a possibility to choose the path where the vaspout.h5 will be stored?
Best regards,
Emmi Gareis