MPI error at large NBANDS
Posted: Sun Mar 29, 2020 11:00 pm
Hello,
I have been using vasp 5.4.4 for a while; it's compiled on our university cluster with openmpi 2.1.2, and our cluster runs SGE job management system.
I started to notice that VASP reliably bugs out for certain combinations of NPAR, NCORE and NBANDS. There seems to be a maximum allowed NBANDS for each combination of NPAR and NCORE. For example, when NCORE=4, NPAR=6, the highest NBANDS is 96; this means that if I set NBANDS to any value equal or less than 96, this calculation will run smoothly, but if I set NBANDS to any value greater than 96, the calculation bugs out after displaying a MPI error. This bug happens very consistently, and can be reproduced 100% of the time. I will show the error messages at the end of the post.
Because NPAR and NCORE should multiply to the total number of cores, there are a finite set of combinations of NPAR and NCORE for a certain core number. A highest allowed NBANDS is found for all of the combinations, beyond which this bug occurs reliably.
I tested and I concluded that k-mesh, KPAR, the use of SGE system, ALGO(Davidson vs steepest descent), PREC, NSIM, ENCUT, number of atoms, memory size are NOT part of the problem.
This bug is usually not a problem for me for small unit cells due to the low number of bands needed. However, it's stopping me from being able to perform some supercell calculations as more atoms -> more electrons -> more bands needed.
Since this bug that's obvious to me is not seen reported here, I suspect this has to do with how VASP was compiled on our cluster with openmpi. I did consult our cluster IT support but they had no idea either.
Thanks,
Yueguang Shi
*This is the 2nd time I am making this post, the 1st post didn't go through it seems, as I forgot to attach the zip file as required. Sorry if the 1st post actually went through and this would then be a duplicate post.
I will be posting a set of sample outputs from the MPI bug below, note that there are some variations in what error messages I am getting.
*********************************************************************************************************************************************
stdout and stderr of SGE system:
[machine:56705] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[machine:56705] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
*********************************************************************************************************************************************
end of VASP stdout:
entering main loop
N E dE d eps ncg rms rms(c)
[machine:56715] *** An error occurred in MPI_Bcast
[machine:56715] *** reported by process [995426305,47163035877381]
[machine:56715] *** on communicator MPI COMMUNICATOR 14 SPLIT FROM 12
[machine:56715] *** MPI_ERR_TRUNCATE: message truncated
[machine:56715] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[machine:56715] *** and potentially your MPI job)
*********************************************************************************************************************************************
end of VASP OUTCAR:
--------------------------------------- Iteration 1( 1) ---------------------------------------
POTLOK: cpu time 0.9113: real time 0.9166
SETDIJ: cpu time 0.0106: real time 0.0106
*********************************************************************************************************************************************
*The files I attached as bug_test.zip are disguised versions of my real calculations / inputs, but should show the essence of the MPI bug nonetheless. I have been working on my calculations for a while and my calculations definitely work perfectly well as long as the MPI error doesn't occur.
I have been using vasp 5.4.4 for a while; it's compiled on our university cluster with openmpi 2.1.2, and our cluster runs SGE job management system.
I started to notice that VASP reliably bugs out for certain combinations of NPAR, NCORE and NBANDS. There seems to be a maximum allowed NBANDS for each combination of NPAR and NCORE. For example, when NCORE=4, NPAR=6, the highest NBANDS is 96; this means that if I set NBANDS to any value equal or less than 96, this calculation will run smoothly, but if I set NBANDS to any value greater than 96, the calculation bugs out after displaying a MPI error. This bug happens very consistently, and can be reproduced 100% of the time. I will show the error messages at the end of the post.
Because NPAR and NCORE should multiply to the total number of cores, there are a finite set of combinations of NPAR and NCORE for a certain core number. A highest allowed NBANDS is found for all of the combinations, beyond which this bug occurs reliably.
I tested and I concluded that k-mesh, KPAR, the use of SGE system, ALGO(Davidson vs steepest descent), PREC, NSIM, ENCUT, number of atoms, memory size are NOT part of the problem.
This bug is usually not a problem for me for small unit cells due to the low number of bands needed. However, it's stopping me from being able to perform some supercell calculations as more atoms -> more electrons -> more bands needed.
Since this bug that's obvious to me is not seen reported here, I suspect this has to do with how VASP was compiled on our cluster with openmpi. I did consult our cluster IT support but they had no idea either.
Thanks,
Yueguang Shi
*This is the 2nd time I am making this post, the 1st post didn't go through it seems, as I forgot to attach the zip file as required. Sorry if the 1st post actually went through and this would then be a duplicate post.
I will be posting a set of sample outputs from the MPI bug below, note that there are some variations in what error messages I am getting.
*********************************************************************************************************************************************
stdout and stderr of SGE system:
[machine:56705] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[machine:56705] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
*********************************************************************************************************************************************
end of VASP stdout:
entering main loop
N E dE d eps ncg rms rms(c)
[machine:56715] *** An error occurred in MPI_Bcast
[machine:56715] *** reported by process [995426305,47163035877381]
[machine:56715] *** on communicator MPI COMMUNICATOR 14 SPLIT FROM 12
[machine:56715] *** MPI_ERR_TRUNCATE: message truncated
[machine:56715] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[machine:56715] *** and potentially your MPI job)
*********************************************************************************************************************************************
end of VASP OUTCAR:
--------------------------------------- Iteration 1( 1) ---------------------------------------
POTLOK: cpu time 0.9113: real time 0.9166
SETDIJ: cpu time 0.0106: real time 0.0106
*********************************************************************************************************************************************
*The files I attached as bug_test.zip are disguised versions of my real calculations / inputs, but should show the essence of the MPI bug nonetheless. I have been working on my calculations for a while and my calculations definitely work perfectly well as long as the MPI error doesn't occur.