Using VASP 6.2 OpenACC GPU port

Queries about input and output files, running specific calculations, etc.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
david_keller
Newbie
Newbie
Posts: 21
Joined: Tue Jan 12, 2021 3:17 pm

Using VASP 6.2 OpenACC GPU port

#1 Post by david_keller » Fri Jan 29, 2021 8:04 pm

After https://www.vasp.at/forum/viewtopic.php?f=2&t=18020
I ended up being able to compile and link successfully!
Thanks for all the help!

(BTW - I linked to Intel's MKl BLAS and Scalapack and FFTW libraries.
Are any of these libraries making use of GPUs via OpenACC in the PGI compiled versions?)

Now when I try to run the test suite the very first job errors with:

Code: Select all

VASP_TESTSUITE_RUN_FAST="Y"

Executed at: 14_54_01/29/21
==================================================================

------------------------------------------------------------------

CASE: andersen_nve
------------------------------------------------------------------
CASE: andersen_nve
entering run_recipe andersen_nve
andersen_nve step STD
------------------------------------------------------------------
andersen_nve step STD
entering run_vasp_g
 running on    4 total cores
 distrk:  each k-point on    2 cores,    2 groups
 distr:  one band on    1 cores,    2 groups
 OpenACC runtime initialized ...    1 GPUs detected
 -----------------------------------------------------------------------------
|                                                                             |
|     EEEEEEE  RRRRRR   RRRRRR   OOOOOOO  RRRRRR      ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     EEEEE    RRRRRR   RRRRRR   O     O  RRRRRR       #       #       #      |
|     E        R   R    R   R    O     O  R   R                               |
|     E        R    R   R    R   O     O  R    R      ###     ###     ###     |
|     EEEEEEE  R     R  R     R  OOOOOOO  R     R     ###     ###     ###     |
|                                                                             |
|     M_init_nccl: Error in ncclCommInitRank                                  |
|                                                                             |
|       ---->  I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <----       |
|                                                                             |
 -----------------------------------------------------------------------------

david_keller
Newbie
Newbie
Posts: 21
Joined: Tue Jan 12, 2021 3:17 pm

Re: VASP 6.2

#2 Post by david_keller » Mon Feb 01, 2021 7:14 pm

Is it possible to build and run an OpenACC version with Openmp on as well?
How can I best make of a server with 1GPU and 40 CPUs?
If I were to get 3 More GPUs on my server how would I best run in in terms of number of proesses, number of threads and number of GPUs?
Also, I may have already submitted a similar reply but I can not see if there is one waiting in queue for review somewhere?
Again,
Thanks for all your help.

mmarsman
Global Moderator
Global Moderator
Posts: 12
Joined: Wed Nov 06, 2019 8:44 am

Using VASP 6.2 OpenACC GPU port

#3 Post by mmarsman » Tue Feb 02, 2021 11:01 am

Hi David,

The NCCL error message you encounter is probably a consequence of the fact that you start VASP with a number of MPI-ranks that is greater than the number of GPUs you have available.
Unfortunately the current versions of NCCL do not allow MPI-ranks to share a GPU, so you are forced to use one MPI-rank per GPU.
This information still needs to go onto our wiki (I'm currently working on it, sorry for the delay and the inconvenience this caused you!).

In case you have a CPU with 40 cores (threads?) and only one GPU, that is indeed a bit unfortunate.
Getting 3 more GPUs would of course change this somewhat (in addition to adding a whopping amount of compute power GPU-side).
In addition, another option you already allude to yourself: you can add OpenMP into the mix as well.
Yes, it is indeed a good idea to build the OpenACC version with OpenMP support.
In that case each MPI-rank may spawn a few threads and you'll get better CPU-usage in those code paths that still remain CPU side.
I will add a "makefile.include" template for OpenACC+OpenMP to the wiki ASAP.

Another solution to the above would be not to use NCCL and use MPS to have the MPI-ranks share a GPU.
That probably is still the best option for small calculations, where a single MPI-ranks has trouble saturating the GPU.
... at the moment however there is a part of the code that breaks without NCCL support (the hybrid functionals).
I'm working on changing that!

Cheers,
Martijn Marsman

david_keller
Newbie
Newbie
Posts: 21
Joined: Tue Jan 12, 2021 3:17 pm

Re: VASP 6.2

#4 Post by david_keller » Wed Feb 03, 2021 1:57 pm

Thanks Martijn,

I will keep my eye out for a wiki update on the OpenACC+OpenMP build.

Another question on the 6.2 documentation that explains usage.

You Doc says:
The execution statement depends heavily on your system! Our reference system consists of compute nodes with 4 cores per CPU. The examples job script given here is for a job occuping a total of 64 cores, so 16 phyiscal nodes.

On our clusters
1 openMPI process per node
#$ -N test
#$ -q narwal.q
#$ -pe orte* 64

mpirun -bynode -np 8 -x OMP_NUM_THREADS=8 vasp
The MPI option -bynode ensures that the VASP processes are started in a round robin fashion, so each of the physical nodes gets 1 running VASP process. If we miss out this option, on each of the first 4 physical nodes 4 VASP processes would be started, leaving the remaining 12 nodes unoccupied.
It would seem that if you are using 16 nodes with 64 total processors then the mpirun should have either 'np=64' if it refers to all processes or 'np=4' if it refers to processes per node? It would seem that you probably would want more than one thread be process to keep pipelines full?

I am confused...

david_keller
Newbie
Newbie
Posts: 21
Joined: Tue Jan 12, 2021 3:17 pm

Re: VASP 6.2

#5 Post by david_keller » Wed Feb 03, 2021 4:41 pm

Hi Martijn,

Multiple CPU ranks sharing a GPU would be most likely be good. I am trying to do some profiling to see if there is GPU idle time that could be soaked up in this way.

I succeeded in getting an OpenACC and OpenMP version to compile. My include file will follows. I had to build and OpenMP enabled version of FFTW3 first.

Unfortunately, it seems to slow down throughput of a 1GPU 1CPU run with omp_threads > 1. It slows the run by 15% with omp_threads=2 and 30% with omp_threads=4.
#module load nvidia-hpc-sdk/20.11
#module load anaconda3/2018.12/b2


# Precompiler options
CPP_OPTIONS= -DHOST=\"LinuxPGI\" \
-DMPI -DMPI_BLOCK=8000 -DMPI_INPLACE -Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Dvasp6 \
-Duse_bse_te \
-Dtbdyn \
-Dqd_emulate \
-Dfock_dblbuf \
-D_OPENACC \
-DUSENCCL -DUSENCCLP2P -D_OPENMP

CPP = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX) > $*$(SUFFIX)

FC = mpif90 -acc -gpu=cc70,cc80,cuda11.1 -mp
FCL = mpif90 -acc -gpu=cc70,cc80,cuda11.1 -c++libs

FREE = -Mfree

FFLAGS = -Mbackslash -Mlarge_arrays

OFLAG = -fast

DEBUG = -Mfree -O0 -traceback

# Specify your NV HPC-SDK installation, try to set NVROOT automatically
NVROOT =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')
# ...or set NVROOT manually
#NVHPC ?= /opt/nvidia/hpc_sdk
#NVVERSION = 20.9
#NVROOT = $(NVHPC)/Linux_x86_64/$(NVVERSION)

# Use NV HPC-SDK provided BLAS and LAPACK libraries
BLAS = -lblas
LAPACK = -llapack

BLACS =
SCALAPACK = -Mscalapack

CUDA = -cudalib=cublas,cusolver,cufft,nccl -cuda

LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) $(CUDA)

# Software emulation of quadruple precsion
QD ?= $(NVROOT)/compilers/extras/qd
LLIBS += -L$(QD)/lib -lqdmod -lqd
INCS += -I$(QD)/include/qd

# Use the FFTs from fftw
#FFTW ?= /opt/gnu/fftw-3.3.6-pl2-GNU-5.4.0
#FFTW = /cm/shared/software/fftw3/3.3.8/b6
FFTW = /g1/ssd/kellerd/vasp_gpu_work/fftw-3.3.9_nv
#LLIBS += -L$(FFTW)/lib -lfftw3
LLIBS += -L$(FFTW)/.libs -L$(FFTW)/threads/.libs -lfftw3 -lfftw3_omp
#INCS += -I$(FFTW)/include
INCS += -I$(FFTW)/mpi -I$(FFTW)/api

OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o

# Redefine the standard list of O1 and O2 objects
SOURCE_O1 := pade_fit.o
SOURCE_O2 := pead.o

# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = nvfortran
CC_LIB = nvc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1 -Mfixed
FREE_LIB = $(FREE)

OBJECTS_LIB= linpack_double.o getshmem.o

# For the parser library
CXX_PARS = nvc++ --no_warnings

# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin

david_keller
Newbie
Newbie
Posts: 21
Joined: Tue Jan 12, 2021 3:17 pm

Re: Using VASP 6.2 OpenACC GPU port

#6 Post by david_keller » Thu Feb 04, 2021 1:33 pm

Buildin using the nvidia-hpc-sdk there is an unsatisfied referance to libatomic.so.
Where can this be found?

mmarsman
Global Moderator
Global Moderator
Posts: 12
Joined: Wed Nov 06, 2019 8:44 am

Re: Using VASP 6.2 OpenACC GPU port

#7 Post by mmarsman » Wed Feb 10, 2021 6:58 pm

Hi David,

So I've finally found some time to work on an OpenACC wiki-page (not finished yet but getting there):

wiki/index.php/OpenACC_GPU_port_of_VASP

I've put in a link to an OpenACC+OpenMP makefile.include as well .
For the latter case I link to Intel's MKL library for CPU-sided FFTS, BLAS, LAPACK, and scaLAPACK calls.
Unsurprisingly, this is unbeatable on Intel CPUs, especially where threaded FFTs are concerned.
I do not know what CPU you have in your system but if it's an Intel then use MKL.

Unfortunately I noticed our description of the use of OpenMP (how to place the MPI-ranks and OpenMP threads etc) is outdated and crappy, so I will have to find time to work on that next.

With respect toyour question on the unsatisfied "libatomic.so" in your build with the NVIDIA HPC-SDK: I have no idea, having never encountered this problem myself. But maybe this issue was solved along the way ... I have lost my way in this forum-thread a bit, I fear :)

Regarding the performance of VASP on GPUs: putting work onto accelerators involves some overhead in the form of data transfers back and forth and launching of kernels. In practice this means that for small jobs you will probably see that the GPUs may not perform as well as you might be hoping (compared to CPU runs).
Correspondingly GPU idle time will be high for small jobs. The CPU will not be able to parcel out the work to the GPU fast enough to saturate it.
This is not surprising considering the enormous amount of flops these cards represent. You may see it as trying to run a small job on a too large number of CPU cores. If there's not enough work to parallelise over, performance will drop at some point.

Another thing: for now please use the NVIDIA HPC-SDK 20.9! According to our contacts at NVIDIA version 20.11 has certain performance issues (with particular constructs in VASP), and in version 21.1 a bug was introduced that may even lead to wrong results.
I was assured these issues will all be fixed in the next release of the NVIDIA HPC-SDK (v21.2).
(I will put this in the wiki as well.)

Cheers,
Martijn

david_keller
Newbie
Newbie
Posts: 21
Joined: Tue Jan 12, 2021 3:17 pm

Re: Using VASP 6.2 OpenACC GPU port

#8 Post by david_keller » Thu Feb 18, 2021 3:17 pm

Thanks Martijn!

We compiled and linked using SDK 20.9 with the Intel MKL FFT routines rather than SDK 20.11 with fftw-3.3.9 (built using the Nvida SDK).
Our test run elapsed time dramatically changed between a run with OpenACC with 1 GPU and a run with 40 CPUs alone:

20.11 20.09+MKL
Elap Maxd Elap Maxd
1GPU/1CPU 486 .48e-2 348 .70e-2
40 CPU 184 .46e-2 338 .55e-2

So the elapsed time was slower for the CPU run using 20.9+MKL, but the GPU run became faster.

I do not know how significant it is, but the only result that was off in more significant digits was 'maximum difference moved'. Random seeds differ etc. so I presume you should not see identical results, but I do not know how important a max distance moved output would be?

This is after NMD=10.

BTW - we are still trying to get to some results we have seen where a single GPU runs twice as fast as 40 CPUs.

Dave Keller

david_keller
Newbie
Newbie
Posts: 21
Joined: Tue Jan 12, 2021 3:17 pm

Re: Using VASP 6.2 OpenACC GPU port

#9 Post by david_keller » Tue Feb 23, 2021 2:44 pm

Hi Martijn,

Would you please clarify a couple of things for me?

Is it true that with 6.2 only a single executable need be compiled that will use OpenMP if OMP_NUM_THREADS>1and will be GPU capable if available on the node?

If so, should you want to run NOT using GPUs on a node that has GPUs is there a way to do so? Our experiance is the code will automatically use one if available.

Thanks for your help,

Dave Keller
LLE HPC

david_keller
Newbie
Newbie
Posts: 21
Joined: Tue Jan 12, 2021 3:17 pm

Re: Using VASP 6.2 OpenACC GPU port

#10 Post by david_keller » Mon Mar 08, 2021 1:44 pm

I found out that setting CUDA_VISIBLE_DEVICES="" (null) will cause the OpenACC version to NOT look for GPUs.

mmarsman
Global Moderator
Global Moderator
Posts: 12
Joined: Wed Nov 06, 2019 8:44 am

Re: Using VASP 6.2 OpenACC GPU port

#11 Post by mmarsman » Tue Jun 15, 2021 9:40 am

Hi David,

Sorry for the delay! In answer to your last questions:

Yes, when you build with OpenACC and OpenMP you will end up with an executable that will use OpenMP when OMP_NUM_THREADS > 1 and will be GPU capable if available on the node.
In principle you can then forbid the use of GPUs by making them "invisible" in the manner you describe.

However, there are several instances in the code where OpenMP threading is inactive as soon as you specify -D_OPENACC (i.e., compile with OpenACC support).
So you will lose OpenMP related performance as soon as you compile with OpenACC support.
The most severe example is in the use of real-space projection operators: in the "normal" OpenMP version of the code, the work related to these real-space projectors is distributed over OpenMP threads, but as soon as you switch request OpenACC support (at compile time) this is no longer the case.

The idea of the OpenACC + OpenMP version is that OpenMP adds some additional CPU performance in those parts that have not been ported to the GPU (yet).

If you want the optimal OpenMP + MPI performance you should compile a dedicated executable *without* OpenACC support.

Cheers,
Martijn

Post Reply