Efficiency of Simultaneous Multithreading on a Cluster

Message

mwuensch · #1 Post by **mwuensch** » Wed Oct 18, 2023 10:14 am

Hi,
I'm currently starting to use VASP on a cluster, rather than on a workstation, and have a question concerning AMD's version of hyperthreading. On the Wiki I found a warning that stated that hyperthreading is not beneficial to VASP's performance. I don't quite understand whether this statement applies to my situation:

I'm running calculations with 448 bands present. I found online that ideally the number of cores used is NBANDS/4, so in my case 112. VSC5 has 64 core nodes using AMD 7713 CPUs, which have 2 threads per core. Instinctively I'd use one node, using 112 threads using "mpirun --use-hwthread-cpus -np 112 vasp_std".

Is this more or less efficient than just using the 64 physical cores without utilising SMT (running with -np 64)? I'll try benchmarking it myself over the next couple of days, but I'd like to understand what the warning in the wiki really means ^^

And maybe since you're here: I also saw on the wiki that NCORE = 4 is good for systems with around 100 atoms. Should I also set KPAR to another value than 1 for my situation? According to a post I found KPAR should equal the number of nodes used, as long as that divides the number of k-points evenly. I'll work with one node at a time for now, so is the default KPAR ideal?

Cheers and thanks for the help,
Max

#2 Post by **martin.schlipf** » Thu Oct 19, 2023 12:23 pm

HPC is getting more and more diverse and this makes it harder and harder to come up with good rules that will work everywhere. There are obvious things like that a GPU having different performance characteristics and therefore different recommendations than a CPU but this applies for other systems as well.

With this disclaimer out of the way, here are a couple of points to your question: We typically observe that hyperthreading is not more performant than using less ranks. It appears that the context-switching overhead is larger than the potential benefit. There may be other less obvious things that contribute to that: When you run more ranks than you have cores, MPI needs to decide on a process placement of these. If ranks that would like to communicate compete for the same resources, you may suffer from an unintended wait period.

Regarding NCORE, we now recommend to use between 2 to the number of cores per node. The optimal choice in this range depends on the system you want to study. For KPAR, the tradeoff is between an increased memory consumption and an increased speed. If you can afford the memory, it should be one of the most efficient parallelization modes of VASP. Just keep in mind to choose a factor of the irreducible number of k-points.

More generally and this brings me back to the disclaimer: Test your parallel setup on your specific machine. It is very common that you want to run multiple similar calculations e.g. you run an ionic relaxation of multiple electronic scf steps. Run the first few of these varying the choices for your parallelization parameters and then run the production calculation with the optimal setup.

mwuensch · #3 Post by **mwuensch** » Fri Oct 20, 2023 5:14 pm

Thanks for the response, it's very helpful!

Regarding KPAR: The wiki states that KPAR should be an integer divisor of the number of cores, and according to you it has to be a factor of the irreducible number of k-points as well? Am I right in assuming the trick is then to reduce the number of cores until you find as high a factor as your memory allows?

#4 Post by **martin.schlipf** » Sun Oct 22, 2023 4:43 pm

While KPAR being a factor of the number of cores is a strict requirement, KPAR being a factor of the irreducible number of k points is only a strong performance advice.

Say you have 39 irreducible k points: If you limit yourself to KPAR=3, 13, or 39 you may struggle to fit it in a typical node. If you use KPAR=8 you will lose a bit of performance but it fits better on nodes. So the precise advice is to use a factor or a KPAR that factorizes a number a little bit larger. In the example you would calculate k point 1-8, 9-16, 17-24, 25-32, and in the last loop one k point group would idle while the rest deal with 33-39.

Regarding the memory you can compute this or look in the OUTCAR file. Typically k-point parallelization is very efficient so you push this until you max out on memory

mwuensch · #5 Post by **mwuensch** » Sun Oct 22, 2023 5:00 pm

Thanks again, you've been incredibly helpful! ^^

My Community

Efficiency of Simultaneous Multithreading on a Cluster

Efficiency of Simultaneous Multithreading on a Cluster

Re: Efficiency of Simultaneous Multithreading on a Cluster

Re: Efficiency of Simultaneous Multithreading on a Cluster

Re: Efficiency of Simultaneous Multithreading on a Cluster

Re: Efficiency of Simultaneous Multithreading on a Cluster