Page 1 of 1
VASP_GPU Performance Issue
Posted: Tue Oct 24, 2023 1:12 pm
by burakgurlek
Dear all,
I wanted to test the performance of GPU run over CPU one. I tried a single SCF loop, but I have not seen any improvement with GPU use (3 times slower compared to CPU, which is calculated as NBANDS*KPOINTS/8/NCORE) despite NSIM=128 and 1 MPI rank per GPU. I probably make some mistakes and would be happy if you could help. The simulation files are attached.
Regards,
Burak
Re: VASP_GPU Performance Issue
Posted: Mon Nov 06, 2023 11:26 am
by marie-therese.huebsch
Hi,
Thank you for running performance tests and sharing them here!
It is true that for the comparison of time to solution for GPU vs. CPU, the CPU version has the upper hand by a factor of 3
for this particular calculation.
Code: Select all
OUTCAR_CPU: LOOP+: cpu time 670.8752: real time 672.0153
OUTCAR_GPU: LOOP+: cpu time 1941.4516: real time 1905.4101
There are a couple of aspects to note:
- Most of the time is lost when computing the VdW forces.
Code: Select all
OUTCAR_CPU: FORVDW: cpu time 13.9918: real time 14.0008
OUTCAR_GPU: FORVDW: cpu time 959.5451: real time 951.2572
That is because this part of the code has not been ported to GPU. Thus, at this point, we cannot recommend running calculations that include VdW forces on GPU.
- The GPU run takes 2 more iteration steps to reach convergence.
Code: Select all
1 OUTCAR_CPU: LOOP: cpu time 31.0964: real time 31.3084
2 OUTCAR_CPU: LOOP: cpu time 37.0171: real time 37.1010
3 OUTCAR_CPU: LOOP: cpu time 36.1667: real time 36.2138
4 OUTCAR_CPU: LOOP: cpu time 35.8931: real time 35.9381
5 OUTCAR_CPU: LOOP: cpu time 36.2547: real time 36.3063
6 OUTCAR_CPU: LOOP: cpu time 29.3839: real time 29.4306
7 OUTCAR_CPU: LOOP: cpu time 32.8986: real time 32.9410
8 OUTCAR_CPU: LOOP: cpu time 28.9229: real time 28.9653
9 OUTCAR_CPU: LOOP: cpu time 32.5930: real time 32.6356
10 OUTCAR_CPU: LOOP: cpu time 33.8198: real time 33.8665
11 OUTCAR_CPU: LOOP: cpu time 32.8474: real time 32.8873
12 OUTCAR_CPU: LOOP: cpu time 34.5196: real time 34.5708
13 OUTCAR_CPU: LOOP: cpu time 40.0197: real time 40.0728
14 OUTCAR_CPU: LOOP: cpu time 34.0037: real time 34.0559
15 OUTCAR_CPU: LOOP: cpu time 33.3978: real time 33.4419
16 OUTCAR_CPU: LOOP: cpu time 35.9634: real time 36.0134
17 OUTCAR_CPU: LOOP: cpu time 38.1054: real time 38.1565
18 OUTCAR_CPU: LOOP: cpu time 25.3977: real time 25.4350
19 OUTCAR_CPU: LOOP+: cpu time 670.8752: real time 672.0153
Code: Select all
1 OUTCAR_GPU: LOOP: cpu time 40.1964: real time 39.7315
2 OUTCAR_GPU: LOOP: cpu time 46.5786: real time 46.2402
3 OUTCAR_GPU: LOOP: cpu time 48.1026: real time 47.7735
4 OUTCAR_GPU: LOOP: cpu time 54.6248: real time 54.3239
5 OUTCAR_GPU: LOOP: cpu time 57.1046: real time 55.8257
6 OUTCAR_GPU: LOOP: cpu time 38.0119: real time 39.7700
7 OUTCAR_GPU: LOOP: cpu time 42.5045: real time 41.0562
8 OUTCAR_GPU: LOOP: cpu time 37.5500: real time 36.0945
9 OUTCAR_GPU: LOOP: cpu time 42.1764: real time 40.6554
10 OUTCAR_GPU: LOOP: cpu time 44.2746: real time 42.7324
11 OUTCAR_GPU: LOOP: cpu time 42.5493: real time 41.0302
12 OUTCAR_GPU: LOOP: cpu time 46.4575: real time 44.9322
13 OUTCAR_GPU: LOOP: cpu time 50.7778: real time 49.2683
14 OUTCAR_GPU: LOOP: cpu time 43.5063: real time 41.9760
15 OUTCAR_GPU: LOOP: cpu time 42.6561: real time 41.1782
16 OUTCAR_GPU: LOOP: cpu time 45.6093: real time 44.1545
17 OUTCAR_GPU: LOOP: cpu time 48.8931: real time 47.4580
18 OUTCAR_GPU: LOOP: cpu time 33.3592: real time 31.8699
19 OUTCAR_GPU: LOOP: cpu time 35.0750: real time 33.5765
20 OUTCAR_GPU: LOOP: cpu time 29.2735: real time 28.8527
21 OUTCAR_GPU: LOOP+: cpu time 1941.4516: real time 1905.4101
This could just as well be the other way around. So, there is no fundamental conclusion we can draw from this observation.
- Time to solution vs. power per iteration step: I understand the interest in comparing time to solution, but alternatively, one can look at time per iteration step to judge the performance. If we do that and subtract the contribution from the VdW forces, we still observe that the GPU run takes about 25% more time to solution. Additionally, you could consider the power consumption and availability of resources. Depending on your hardware, the GPU may have the upper hand (for calculations without VdW forces) after all when considering power per iteration step.
I hope these comments are helpful.
Cheers,
Marie-Therese