VASP- GPU fails to converge

Message

scanmat_centre · #1 Post by **scanmat_centre** » Thu Jun 16, 2022 5:25 am

We are using VASP- GPU for hybrid calculations and we are getting error as follows

Device Memory Info:
Total: 16276.2 MB
Free: 1.2 MB
Used: 16275.0 MB
Requested: 1.9 MB

CUDA Error in cuda_mem.cu, line 179: out of memory
Failed to allocate device memory!

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 55048 RUNNING AT scanmatdgx1
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
[/color]

But we have enough space in the device.
How to proceed further?

#2 Post by **martin.schlipf** » Fri Jun 17, 2022 6:56 am

Can you provide a bit more information about how you run these calculations? Did you try smaller systems successfully and are now running this larger calculation that fails, or do you get the same error for any case you use?
In the former case, how do you know that you have enough memory? What specifically did you compare against what?
In the latter case, could you provide the input files for the calculations you run?
Either way, can you also tell me which version of VASP you are using and whether you use the deprecated CUDA port or the OpenACC version?

scanmat_centre · #3 Post by **scanmat_centre** » Sat Jun 25, 2022 6:23 am

Yes. for smaller systems, it ran successfully. for supercells only, it is failing.
I am using VASP 5.4.1
and my input files are as follows

INCAR
System = z

!Star Parameters for this run:

ISTART = 1 !0 Start job: 1 restart constant energy cut-off 2 restart constant basis set
PREC = Accurate
LWAVE= .TRUE.
LREAL = TRUE !
!!Electronic relaxation :
EDIFF = 1E-6 ! accuracy required 1E-6
NELMIN = 5 !no of ELM steps !
LORBIT = 11
!!Ionic relaxation:
ENCUT = 400
ISMEAR = 0
SIGMA = 0.01
EDIFFG= -0.01
#GGA = PE
LHFCALC = .TRUE.
HFSCREEN = 0.2
PRECREEN = Fast
AEXX = 0.25
ALGO = All
LVDW= TRUE
IVDW = 1
NBANDS= 100

script

#!/bin/bash
#SBATCH --job-name=12.5Sbnd
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=14
#SBATCH --distribution=cyclic:cyclic
#SBATCH --time=420:00:00
#SBATCH --mem-per-cpu=8000
##SBATCH --mail-type=END,FAIL
##SBATCH --mail-user=email@ufl.edu
#SBATCH --partition=debug
#SBATCH --gres=gpu:1
date;hostname;pwd

ulimit -s unlimited
ulimit -l unlimited
ulimit -m unlimited

pwd; hostname; date |tee result
# Setting some variables

module load vasp
module load CUDA/9.0

#for i in 15 16; do
# echo "n = $i"

WORK=$SLURM_SUBMIT_DIR

echo $WORK

# making scratch directory
SCRATCH=/home/${USER}/example/${SLURM_JOBID}
echo ${SCRATCH}
mkdir -p $SCRATCH/test
RUN=$SCRATCH/test

# Goto run dir
cd $RUN

# Copy inpufiles to common scratch
cp $WORK/INCAR_bnd $RUN/INCAR
cp $WORK/CONTCAR-opt $RUN/POSCAR
cp $WORK/POTCAR $RUN
cp $WORK/IBZKPT-bnd $RUN/KPOINTS
#cp $WORK/WAVECAR $RUN/WAVECAR
#cp $WORK/CHGCAR $RUN/CHGCAR
ls -ltr

mpirun vasp_gpu
#mpirun vasp_std

cp OUTCAR $WORK/OUTCAR-band
cp CONTCAR $WORK/CONTCAR-band
cp DOSCAR $WORK/DOSCAR
cp PROCAR $WORK/PROCAR
cp EIGENVAL $WORK/EIGENVAL
cp vasprun.xml $WORK/vasprunband.xml

cd $WORK
rm -rf $RUN

#done

#4 Post by **martin.schlipf** » Mon Jun 27, 2022 9:36 am

I'm still not sure how you judge that you have enough memory. It seems that you would like to do band structure calculations, in this case you can reduce the memory demand by splitting the calculation into multiple subparts or by using less points per line.
Unfortunately, I cannot provide more specific advice for your system, because the old Cuda port is not maintained anymore. If you can reproduce this behavior with the OpenACC version, we would need to look into it more carefully.

scanmat_centre · #5 Post by **scanmat_centre** » Mon Jun 27, 2022 2:07 pm

The memory I am talking about is the memory possessed by the deice- I mean the supercomputer in which we are running the calculation.
Am I wrong in assuming memory?
Or is there any other measure I have to consider?

#6 Post by **martin.schlipf** » Mon Jun 27, 2022 3:33 pm

Well there are two parts to the comparison, the memory available on the device and the memory that VASP needs to perform the calculation.
In particular for band structure calculations, the memory requirement can be quite a bit larger than for the self-consistency calculation, because the number of k-points is often larger.

Then again, I don't know how efficient the hybrid functional in the old Cuda port was. This part was worked on a lot in the OpenACC port to enhance the performance on one or more GPUs.

scanmat_centre · #7 Post by **scanmat_centre** » Tue Jun 28, 2022 4:53 am

Is there any way I can modify the memory requirement for vasp to perform the calculation?
Anything I have to do with the script.. ?

The error comes like this,

Device Memory Info:
Total: 16276.2 MB
Free: 1.2 MB
Used: 16275.0 MB
Requested: 1.9 MB

#8 Post by **martin.schlipf** » Tue Jun 28, 2022 6:26 am

scanmat_centre wrote: ↑Tue Jun 28, 2022 4:53 am Is there any way I can modify the memory requirement for vasp to perform the calculation?
Anything I have to do with the script.. ?

Smaller energy cutoffs, less k-points, prec = normal

Of course you need to test whether this affects your results.

scanmat_centre · #9 Post by **scanmat_centre** » Wed Jun 29, 2022 5:10 am

I will check with them.

scanmat_centre · #10 Post by **scanmat_centre** » Wed Jul 06, 2022 8:13 am

Thank you, its working now.
I have reduced ENCUT, and changed Precision to Normal from Accurate.

My Community

VASP- GPU fails to converge

VASP- GPU fails to converge

Re: VASP- GPU fails to converge

Re: VASP- GPU fails to converge

Re: VASP- GPU fails to converge

Re: VASP- GPU fails to converge

Re: VASP- GPU fails to converge

Re: VASP- GPU fails to converge

Re: VASP- GPU fails to converge

Re: VASP- GPU fails to converge

Re: VASP- GPU fails to converge