Page 1 of 1

VASP 6.2.0 -- HSE06 Out of Memory Bug

Posted: Sun Jul 24, 2022 1:50 am
by graham_pritchard
When running an HSE06 calc using vasp 6.2.0 (vasp/6.2.0-openmpi-4.0.5-intel-19.0.5.281-cuda-11.2.1) I received the following bug error:

Code: Select all

 -----------------------------------------------------------------------------
|                     _     ____    _    _    _____     _                     |
|                    | |   |  _ \  | |  | |  / ____|   | |                    |
|                    | |   | |_) | | |  | | | |  __    | |                    |
|                    |_|   |  _ <  | |  | | | | |_ |   |_|                    |
|                     _    | |_) | | |__| | | |__| |    _                     |
|                    (_)   |____/   \____/   \_____|   (_)                    |
|                                                                             |
|     internal error in: mkpoints_full.F  at line: 1161                       |
|                                                                             |
|     internal error in SET_INDPW_FULL: insufficient memory (see wave.F       |
|     safeguard) 189 188                                                      |
|                                                                             |
|     If you are not a developer, you should not encounter this problem.      |
|     Please submit a bug report.                                             |
|                                                                             |
 -----------------------------------------------------------------------------
If I increase the number of cores requested (and thus also the total memory) the HSE06 calculation runs successfully. However both the OUTCAR file and the seff job report show that very little memory was used when the job completes successfully. For example, the following was shown at the end of a successful HSE06 run after requesting 8 nodes with 27 cores each at 3GB per node.

total amount of memory used by VASP MPI-rank0 39449. kBytes
=======================================================================

base : 30000. kBytes
nonl-proj : 706. kBytes
fftplans : 448. kBytes
grid : 470. kBytes
one-center: 15. kBytes
HF : 64. kBytes
nonlr-proj: 2844. kBytes
wavefun : 331. kBytes
fock_wrk : 4571. kBytes

I have attached the relevant job files for this bug, including the OUTCAR for the successful run.

Re: VASP 6.2.0 -- HSE06 Out of Memory Bug

Posted: Mon Jul 25, 2022 7:39 pm
by graham_pritchard
The bug appears to be associated with the ncore setting. If I set ncore = #cores-requested/node then the run crashes with the previously reported bug text. I tried running with 24 and 27 cores requested per node and both failed when I set ncore = 24, 27 respectively. I then tried a number of additional calculations with the following settings:

Kpar Ncore Cores/Node Nodes Calculation runs
12 1 24 1 Yes
12 2 24 1 Yes
1 1 24 1 Yes
1 6 24 1 Yes
1 12 24 1 Yes
1 24 24 1 NO
1 27 27 4 NO
1 27 27 8 Yes

Every calculation runs when ncore is less than the number of cores requested per node! Even when I reduce the number of nodes requested down to 1 (instead of the 8 nodes originally required to run successfully).

Re: VASP 6.2.0 -- HSE06 Out of Memory Bug

Posted: Wed Jul 27, 2022 5:57 am
by henrique_miranda
This error happens due to a complex interplay between the number of plane-wave coefficients in each k-point and the expansion from the irreducible Brillouin zone to the full Brillouin zone together with the plane-wave distribution when NCORE/=1 in hybrid functional calculations LHFCALC=.TRUE.
The error message tries to convey that the number of plane waves to be communicated between nodes is not the same and because of that the allocations on some nodes are not large enough to contain the plane waves from another. But it is not a problem of not having enough memory in the node.
We are working on a solution for this.

In the meantime there are a few workarounds that you might try:
1. Use a smaller value of NCORE. As a rule of thumb, NCORE should always be smaller than the number of cores on each node because at each FFT the coefficients have to be communicated between NCORE processes. The communication between nodes is much slower than the intra-node communication so setting a too large NCORE often results in performance degradation (you can probably see this from the timings of your calculations).
2. The support for NCORE when LHFCALC=.TRUE. is still not extensively tested, if for some reason you don't manage to find a combination of NCORE that works then we would suggest that you deactivate it altogether (i.e. set NCORE=1)
3. You might be able to avoid this issue by choosing a slightly different K-point mesh for example in my testing this works:

Code: Select all

KPOINTS
0
Gamma
  4   3   6
0.5 0.5 0.5
Hope this helps

Re: VASP 6.2.0 -- HSE06 Out of Memory Bug

Posted: Fri Jul 29, 2022 3:22 pm
by graham_pritchard
Thank you for explaining what was going on Henrique, this helps alot.