Segmentation fault in MLWF_TRAFO_RUN of mlwf.F
Posted: Thu May 27, 2021 11:33 am
Dear VASP developers,
I’d like to report an issue about VASP 6.2.0 with wanner90 support.
Referring to “Makefile.include nv acc+omp+mkl” in the VASP Manual, the VASP executables are compiled using NVHPC-SDK 20.9, openmpi-3.1.4 and Intel MKL 21.2.0 with -DVASP2WANNIER90 option.
To perform the HSE06 calculation for large system, we use 8 nodes (4 TESLA P100 for NVlink-Optimized Servers and 2 Xeon E5-2680 v4/node) with NCCL and GPUDirect RDMA supports.
When LWANNIER90 is set to TRUE in INCAR, a segmentation fault occurs during the Wannier projection process displaying “Projection [***/***] done” and then some calculation nodes go down.
I found that this issue is due to an out-of-bounds reference to array at line 1237 of mlwf.F:
PROJECTIONS(:,IK,IS,IW) = MLWF%A_matrix(:,IW,IK,IS)
As shown in lines 972 and 1229, the first dimension of PROJECTIONS is smaller than that of MLWF%A_matrix by MLWF%NEXCLB.
Line 972:
ALLOCATE(MLWF%A_matrix(MLWF%NB_TOT-MLWF%NEXCLB,MLWF%NUM_WANN,MLWF%NKPTS,MLWF%ISPIN))
Line 1229:
ALLOCATE(PROJECTIONS(MLWF%NB_TOT,MLWF%NKPTS,MLWF%ISPIN,MLWF%NUM_WANN))
The following patches may be applied to solve this problem.
--- a/src/mlwf.F 2021-01-18 21:50:58.000000000 +0900
+++ b/src/mlwf.F 2021-05-26 22:21:39.000000000 +0900
@@ -1231,10 +1231,11 @@ MODULE mlwf
ALLOCATE(MLWF%lwindow(MLWF%NB_TOT,MLWF%NKPTS,MLWF%ISPIN))
MLWF%U_matrix = CMPLX(0.0_q,0.0_q)
MLWF%lwindow = .TRUE.
+ PROJECTIONS = CMPLX(0.0_q,0.0_q)
DO IS=1,MLWF%ISPIN
DO IK=1,MLWF%NKPTS
DO IW=1,MLWF%NUM_WANN
- PROJECTIONS(:,IK,IS,IW) = MLWF%A_matrix(:,IW,IK,IS)
+ PROJECTIONS(1:SIZE(MLWF%A_matrix,1),IK,IS,IW) =
MLWF%A_matrix(:,IW,IK,IS)
MLWF%U_matrix(IW,IW,IK,IS) = CMPLX(1.0_q,0.0_q,KIND=q)
ENDDO
ENDDO
This issue seems still remain in VASP 6.2.1.
I’d like to report an issue about VASP 6.2.0 with wanner90 support.
Referring to “Makefile.include nv acc+omp+mkl” in the VASP Manual, the VASP executables are compiled using NVHPC-SDK 20.9, openmpi-3.1.4 and Intel MKL 21.2.0 with -DVASP2WANNIER90 option.
To perform the HSE06 calculation for large system, we use 8 nodes (4 TESLA P100 for NVlink-Optimized Servers and 2 Xeon E5-2680 v4/node) with NCCL and GPUDirect RDMA supports.
When LWANNIER90 is set to TRUE in INCAR, a segmentation fault occurs during the Wannier projection process displaying “Projection [***/***] done” and then some calculation nodes go down.
I found that this issue is due to an out-of-bounds reference to array at line 1237 of mlwf.F:
PROJECTIONS(:,IK,IS,IW) = MLWF%A_matrix(:,IW,IK,IS)
As shown in lines 972 and 1229, the first dimension of PROJECTIONS is smaller than that of MLWF%A_matrix by MLWF%NEXCLB.
Line 972:
ALLOCATE(MLWF%A_matrix(MLWF%NB_TOT-MLWF%NEXCLB,MLWF%NUM_WANN,MLWF%NKPTS,MLWF%ISPIN))
Line 1229:
ALLOCATE(PROJECTIONS(MLWF%NB_TOT,MLWF%NKPTS,MLWF%ISPIN,MLWF%NUM_WANN))
The following patches may be applied to solve this problem.
--- a/src/mlwf.F 2021-01-18 21:50:58.000000000 +0900
+++ b/src/mlwf.F 2021-05-26 22:21:39.000000000 +0900
@@ -1231,10 +1231,11 @@ MODULE mlwf
ALLOCATE(MLWF%lwindow(MLWF%NB_TOT,MLWF%NKPTS,MLWF%ISPIN))
MLWF%U_matrix = CMPLX(0.0_q,0.0_q)
MLWF%lwindow = .TRUE.
+ PROJECTIONS = CMPLX(0.0_q,0.0_q)
DO IS=1,MLWF%ISPIN
DO IK=1,MLWF%NKPTS
DO IW=1,MLWF%NUM_WANN
- PROJECTIONS(:,IK,IS,IW) = MLWF%A_matrix(:,IW,IK,IS)
+ PROJECTIONS(1:SIZE(MLWF%A_matrix,1),IK,IS,IW) =
MLWF%A_matrix(:,IW,IK,IS)
MLWF%U_matrix(IW,IW,IK,IS) = CMPLX(1.0_q,0.0_q,KIND=q)
ENDDO
ENDDO
This issue seems still remain in VASP 6.2.1.