Parallelization tags in hybrid MPI/OpenMP 6.4
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 5
- Joined: Tue Sep 13, 2022 3:12 pm
Parallelization tags in hybrid MPI/OpenMP 6.4
I've been trying to find an optimal set of VASP parallelization tags when using the hybrid MPI/OpenMP version of VASP on 6.4. However, I'm not sure I really understand how parameters like NCORE, NPAR, and KPAR work in this scenario. In the event that all tasks are thrown onto an MPI process it makes sense to me, but when there are multiple openMP threads per MPI task I notice CPU times that I don't really understand. For example, if I have the MPI/openMP setup shown here:
#$ -pe mpi_32_tasks_per_node 256
export OMP_NUM_THREADS=2
export OMP_PROC_BIND=close
export OMP_PLACES=cores
export OMP_STACKSIZE=1G
mpirun --map-by ppr:8:socket:PE=2 ${vasp_ver} > out
So, there are 256 total cores, which will be run on 8 nodes each with 2 sockets and 32 total cores per node, 8 MPI tasks per socket, and 2 openMP threads per MPI task. Now, I have several options for KPAR and NPAR. I've attached my VASP input files for reference, though it shouldn't matter much since they'll be the same for all instances of KPAR and NPAR. When it comes to the VASP parallelization tags this is what I don't understand; see the 4 scenarios listed below:
KPAR = 4, NPAR = 8: Time = 58 sec
KPAR = 8, NPAR = 16, Time = 109 sec
KPAR = 8, NPAR = 8, Time = 120 sec
KPAR = 2, NPAR = 8, Time = 105 sec
What is it about KPAR = 4, NPAR = 8 that makes it 2x more efficient given the MPI/openMP setup shown above than the other options? My understanding in the purely MPI version was that KPAR should be roughly the number of nodes used, and NPAR should be roughly the square root of the total number of cores. But this doesn't seem to be the case for the hybrid MPI/openMP. Any help in understanding why these timings based on KPAR and NPAR are the way they are is greatly appreciated.
#$ -pe mpi_32_tasks_per_node 256
export OMP_NUM_THREADS=2
export OMP_PROC_BIND=close
export OMP_PLACES=cores
export OMP_STACKSIZE=1G
mpirun --map-by ppr:8:socket:PE=2 ${vasp_ver} > out
So, there are 256 total cores, which will be run on 8 nodes each with 2 sockets and 32 total cores per node, 8 MPI tasks per socket, and 2 openMP threads per MPI task. Now, I have several options for KPAR and NPAR. I've attached my VASP input files for reference, though it shouldn't matter much since they'll be the same for all instances of KPAR and NPAR. When it comes to the VASP parallelization tags this is what I don't understand; see the 4 scenarios listed below:
KPAR = 4, NPAR = 8: Time = 58 sec
KPAR = 8, NPAR = 16, Time = 109 sec
KPAR = 8, NPAR = 8, Time = 120 sec
KPAR = 2, NPAR = 8, Time = 105 sec
What is it about KPAR = 4, NPAR = 8 that makes it 2x more efficient given the MPI/openMP setup shown above than the other options? My understanding in the purely MPI version was that KPAR should be roughly the number of nodes used, and NPAR should be roughly the square root of the total number of cores. But this doesn't seem to be the case for the hybrid MPI/openMP. Any help in understanding why these timings based on KPAR and NPAR are the way they are is greatly appreciated.
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 501
- Joined: Mon Nov 04, 2019 12:41 pm
- Contact:
Re: Parallelization tags in hybrid MPI/OpenMP 6.4
Perhaps you could share the OUTCAR files of the 4 different runs.
I tried running your input files and I find that the number of k-points in the IBZKPT file is 4 which would mean that any KPAR>4 would be inefficient (half of your ranks will have no work).
Also, NCORE and NPAR control the same behavior but I personally find NCORE easier to understand.
NCORE determines how many MPI ranks share work on the same orbital using FFT parallelism (distribution of G vectors).
So your list of runs can also be read as:
KPAR = 4, NCORE = 4, Time = 58 sec
KPAR = 8, NCORE = 1, Time = 109 sec
KPAR = 8, NCORE = 2, Time = 120 sec
KPAR = 2, NCORE = 8, Time = 105 sec
(please confirm in your OUTCARs if this is correct)
FFT parallelism implies quite some communication between the ranks that share work on the same orbital so these ranks should be `close together` (within the same socket if possible).
Now, openMP parallelism is also mostly done at the level of the G vectors but there should be enough work to be worth splitting among threads.
What you might be observing is that because you already distribute a lot of the work with NCORE=8 the overhead of creating the openMP threads is not worth the speedup in computation in the run on the last line.
[EDIT: when using openMP NCORE is internally set to 1]
But I would rather see your OUTCARs before making a definitive statement.
I tried running your input files and I find that the number of k-points in the IBZKPT file is 4 which would mean that any KPAR>4 would be inefficient (half of your ranks will have no work).
Also, NCORE and NPAR control the same behavior but I personally find NCORE easier to understand.
NCORE determines how many MPI ranks share work on the same orbital using FFT parallelism (distribution of G vectors).
So your list of runs can also be read as:
KPAR = 4, NCORE = 4, Time = 58 sec
KPAR = 8, NCORE = 1, Time = 109 sec
KPAR = 8, NCORE = 2, Time = 120 sec
KPAR = 2, NCORE = 8, Time = 105 sec
(please confirm in your OUTCARs if this is correct)
FFT parallelism implies quite some communication between the ranks that share work on the same orbital so these ranks should be `close together` (within the same socket if possible).
Now, openMP parallelism is also mostly done at the level of the G vectors but there should be enough work to be worth splitting among threads.
What you might be observing is that because you already distribute a lot of the work with NCORE=8 the overhead of creating the openMP threads is not worth the speedup in computation in the run on the last line.
[EDIT: when using openMP NCORE is internally set to 1]
But I would rather see your OUTCARs before making a definitive statement.
-
- Newbie
- Posts: 5
- Joined: Tue Sep 13, 2022 3:12 pm
Re: Parallelization tags in hybrid MPI/OpenMP 6.4
Thanks for your help. I've attached the 4 OUTCARS; I didn't save them before, so these are re-runs. Some of the timings have changed but are qualitatively the same.
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 501
- Joined: Mon Nov 04, 2019 12:41 pm
- Contact:
Re: Parallelization tags in hybrid MPI/OpenMP 6.4
In what I wrote above I forgot to mention one very important detail:
when you use more than one openMP thread NCORE is internally set to 1.
https://www.vasp.at/wiki/index.php/Comb ... and_OpenMP
you can see that it is so in the OUTCAR files.
You also have 4 k-points in the IBZ which means that the runs with KPAR>4 will be inefficient.
Now we just need to understand why the timings of these two runs are different
KPAR = 4, NPAR = 8: Time = 58 sec
KPAR = 2, NPAR = 8, Time = 105 sec
First, you need to see that the NPAR that you put in the INCAR file is not being respected because NCORE=1.
The actual values being used are
KPAR = 4, NPAR = 32: Time = 58 sec
KPAR = 2, NPAR = 64, Time = 105 sec
you can find this on the top of the respective OUTCAR files
distr: one band on NCORE= 1 cores, 32 groups
distr: one band on NCORE= 1 cores, 64 groups
Now you need to know that more communication is needed between the ranks that are dealing with the bands in one k-point than between the groups of ranks that are dealing with different k-points.
In the first case, one k-point is treated by 32 ranks (2 nodes) while in the second case by 64 (4 nodes) and that is very likely the reason you observe a slowdown in the run with KPAR=2.
when you use more than one openMP thread NCORE is internally set to 1.
https://www.vasp.at/wiki/index.php/Comb ... and_OpenMP
you can see that it is so in the OUTCAR files.
You also have 4 k-points in the IBZ which means that the runs with KPAR>4 will be inefficient.
Now we just need to understand why the timings of these two runs are different
KPAR = 4, NPAR = 8: Time = 58 sec
KPAR = 2, NPAR = 8, Time = 105 sec
First, you need to see that the NPAR that you put in the INCAR file is not being respected because NCORE=1.
The actual values being used are
KPAR = 4, NPAR = 32: Time = 58 sec
KPAR = 2, NPAR = 64, Time = 105 sec
you can find this on the top of the respective OUTCAR files
distr: one band on NCORE= 1 cores, 32 groups
distr: one band on NCORE= 1 cores, 64 groups
Now you need to know that more communication is needed between the ranks that are dealing with the bands in one k-point than between the groups of ranks that are dealing with different k-points.
In the first case, one k-point is treated by 32 ranks (2 nodes) while in the second case by 64 (4 nodes) and that is very likely the reason you observe a slowdown in the run with KPAR=2.
-
- Newbie
- Posts: 5
- Joined: Tue Sep 13, 2022 3:12 pm
Re: Parallelization tags in hybrid MPI/OpenMP 6.4
Ah, so the relationship between KPAR, NPAR, and NCORE is dependent on the total number of MPI tasks, not the total number of cores, which in the purely MPI version that would be the case. Is that correct? The reason that NPAR is set to 32 for the case of KPAR = 4 is because there are a total of 128 MPI tasks, and 128/4 = 32 (and the same for KPAR = 2 yielding NPAR = 64)? In that case, for this hybrid openMP/MPI version, I should really be optimizing KPAR to minimize the number of openMP threads that have to communicate between different sockets, which is effectively being set by NPAR?
-
- Global Moderator
- Posts: 501
- Joined: Mon Nov 04, 2019 12:41 pm
- Contact:
Re: Parallelization tags in hybrid MPI/OpenMP 6.4
YesAh, so the relationship between KPAR, NPAR, and NCORE is dependent on the total number of MPI tasks, not the total number of cores, which in the purely MPI version that would be the case. Is that correct?
YesThe reason that NPAR is set to 32 for the case of KPAR = 4 is because there are a total of 128 MPI tasks, and 128/4 = 32 (and the same for KPAR = 2 yielding NPAR = 64)?
NPAR and NCORE control the same setting. In the openMP case, NCORE is 1 so NPAR cannot be controlled from the INCAR file or put in other words, setting it has no effect because NPAR will be always equal to MPI tasks/KPAR.In that case, for this hybrid openMP/MPI version, I should really be optimizing KPAR to minimize the number of openMP threads that have to communicate between different sockets, which is effectively being set by NPAR?
So yes, the only 'free' parameter is KPAR.
And yes you should try to avoid that openMP threads communicating between sockets.
-
- Newbie
- Posts: 5
- Joined: Tue Sep 13, 2022 3:12 pm
Re: Parallelization tags in hybrid MPI/OpenMP 6.4
That makes sense. So, in the event that I have a single k-point (big unit cell), is the optimal way to handle k-point parallelization to set KPAR=1? In that case what does KPAR accomplish? If there's only 1 k-point then it seems like any value of KPAR that evenly divides the total number of MPI tasks would do the same thing just in a different way, since there's only 1 thing to perform work on. If that's true, then what controls the efficiency of that run is purely set by how many total MPI tasks you can throw at VASP before overburdening it with communication bottlenecks. Is that correct?
-
- Global Moderator
- Posts: 501
- Joined: Mon Nov 04, 2019 12:41 pm
- Contact:
Re: Parallelization tags in hybrid MPI/OpenMP 6.4
If you are doing a gamma-only calculation then it does not make sense to use KPAR>1. If you set it, it will be ignored.
In a pure-MPI version of VASP, you run one MPI rank per CPU and for a given number of CPUs, you can tune NCORE to improve efficiency.
In an MPI and OpenMP hybrid version, for the same set of CPUs, the equivalent setting is to choose the number of openMP threads per MPI rank.
Note that you can use the MPI and OpenMP hybrid version of VASP in the same way as the pure-MPI version provided you set the number of openMP threads to 1.
Of course, more CPUs are better until communication becomes the bottleneck.
In a pure-MPI version of VASP, you run one MPI rank per CPU and for a given number of CPUs, you can tune NCORE to improve efficiency.
In an MPI and OpenMP hybrid version, for the same set of CPUs, the equivalent setting is to choose the number of openMP threads per MPI rank.
Note that you can use the MPI and OpenMP hybrid version of VASP in the same way as the pure-MPI version provided you set the number of openMP threads to 1.
Of course, more CPUs are better until communication becomes the bottleneck.