Best parallelism for 1st order response calculations.

Phonons, DFPT, electron-phonon, electric-field response, mechanical response…

Moderators: mverstra, joaocarloscabreu

Locked
jackbaker
Posts: 6
Joined: Fri Sep 25, 2020 3:33 am

Best parallelism for 1st order response calculations.

Post by jackbaker » Fri Sep 25, 2020 4:33 am

Hi everyone,

I am planning to do some parallel response function calculations (for the phonon dispersions) on a large(ish) system of 80 atoms (352 bands) a 5x3x2 K-point grid (30kpts when kptopt = 3) and a 25Ha cutoff using PAWs and the LDA.

The most time consuming part of this calculation is the finite q part. i.e:

getwfk 1 # Use GS wave functions from dataset1
kptopt 3 # Need full k-point set for finite-Q response
rfphon 1 # Do phonon response
rfatpol 1 80 # Treat displacements of all atoms
rfdir 1 1 1 # Do all directions (symmetry will be used)
tolvrs 1.0d-12 # This default is active for sets 3-10

I want to work out how to efficiently distribute the load for this over k-points, the fft grid and bands. Problem is, paral_kgb doesn't work here, and, setting any value (other than 1) for npfft or npband sets it back to 1 when the calculation starts, i.e,

--- !WARNING
src_file: m_mpi_setup.F90
src_line: 267
message: |
For non ground state calculation, set bandpp, npfft, npband, npspinor npkpt and nphf to 1
...

I have read that only K-pt parallelisation works here, however, the abinit website (https://docs.abinit.org/topics/parallelism/) reports otherwise, saying:

For response calculations, the code has been parallelized (MPI-based parallelism) on k-points, spins, bands, as well as on perturbations. For the k-points, spins and bands parallelisation, the communication load is rather small also, and, unlike for the GS calculations, the number of nodes that can be used in parallel will be large, nearly independently of the physics of the problem. Parallelism on perturbations is very similar to the parallelism on images in the ground state case (so, very efficient), although the load balancing problem for perturbations with different number of k points is not adressed at present. Use of MPIIO is mandatory for the largest speed ups to be observed.

I then have two questions:

1) How does parallelism work in a phonon calculation

2) How do I best set the number of processors for such a calculation (according to the number of bands, kpts and the fft grid?)

3) Does the hybrid MPI/openMP parallelisation help for RF calculations? I ask since it doesn't mention this on the website (from what I have found, at least).

Best,

Jack

ebousquet
Posts: 469
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: Best parallelism for 1st order response calculations.

Post by ebousquet » Sat Sep 26, 2020 4:34 pm

Dear Jack,
jackbaker wrote:
Fri Sep 25, 2020 4:33 am
1) How does parallelism work in a phonon calculation
Parallelism of DFPT works on k-points and bands by default, without parallel_kgb.
This means that you have to first spread the k-points on CPU and if you can put more start to spread over bands. When I say "spread" I mean that you just have to choose the number of CPU in your mpirun calculation and Abinit will handle the parallelism.

jackbaker wrote:
Fri Sep 25, 2020 4:33 am
2) How do I best set the number of processors for such a calculation (according to the number of bands, kpts and the fft grid?)
For example, if you have 40 k-points, you can parallelize the calculation on k-points up to 40 CPU. Then you for each k-point you can parallelize over bands, meaning that you can split the band calculation into, e.g. 2 CPU per k-points, which makes a job of 40x2 = 80 CPU, the speedup should be (ideally) close to 2 times faster than on 40 CPU with only k-points. And then you can go on: 40x4=160 CPU, 40x6=240 CPU, 40x8=320 CPU. The speedup will not be ideal depending on how optimal is the compilation on your machine, how good is the communication between the CPU, etc. Do a small test of speedup if you want to know how good it is on your case and to know what is the optimal number of CPU.
jackbaker wrote:
Fri Sep 25, 2020 4:33 am
3) Does the hybrid MPI/openMP parallelisation help for RF calculations? I ask since it doesn't mention this on the website (from what I have found, at least).
Not yet openMP for DFPT, it is on going!

Best wishes,
Eric

jackbaker
Posts: 6
Joined: Fri Sep 25, 2020 3:33 am

Re: Best parallelism for 1st order response calculations.

Post by jackbaker » Sat Sep 26, 2020 6:09 pm

Hi Eric,

Thanks a lot. This now makes a lot of sense. Just to clarify though:

1) For best performance, I need to just set Nproc = Nkpt*B, where B is an integer, and find the value of B with the most efficient speed-up (which will drop off at large B)?

2) Does this value of B in anyway need to line up with (i.e, be a factor of) the total number of bands? I'm guessing not!

3) Since (i guess) a lot of the performance drop off at large B is from bottlenecks in MPI communication, would i be better off (in the absence of openMP threading) under-occupying nodes to avoid saturation of MPI channels?

Thanks again for your help!

Jack

User avatar
gmatteo
Posts: 291
Joined: Sun Aug 16, 2009 5:40 pm

Re: Best parallelism for 1st order response calculations.

Post by gmatteo » Thu Oct 01, 2020 4:35 am

Code: Select all

Parallelism of DFPT works on k-points and bands by default, without parallel_kgb.
This means that you have to first spread the k-points on CPU and if you can put more start to spread over bands. When I say "spread" I mean that you just have to choose the number of CPU in your mpirun calculation and Abinit will handle the parallelism.
What Eric wrote is correct: paral_kgb is not supported in the DPPT code and Abinit will automatically distribute the k-points and the bands. Note, however, that in the DFPT context the number of k-points is not the number of points in the IBZ used in the GS part. Each perturbation has its own irreducible wedge that is usually larger than the GS IBZ since only the symmetries that preserve the q and the direction of the perturbation can be exploited.

The number of points in the IBZ(q, idir, ipert) (let's call it nk_pertcase) is reported in this section of the main output file:

Code: Select all

 Perturbation wavevector (in red.coord.)   0.000000  0.000000  0.000000
 Perturbation : displacement of atom   1   along direction   1
 Found     2 symmetries that leave the perturbation invariant.
 symkpt : the number of k-points, thanks to the symmetries,
 is reduced to    72 .
This means that DFPT calculations are more expensive than GS as you have more k-points but, on the other hand, this also implies one can use more MPI processes as the code parallelizes the computation of the first-order wavefunctions over nk_pertcase * nband.

There are two points worth considering:

1) If you run all the perturbations in a single input file it's almost impossible to find an optimal number of MPI procs as each perturbation will have its own irreducible wedge. In principle, one can use the parallelism over the perturbations (https://docs.abinit.org/variables/paral/#paral_rf). This technique is handy as everything can be done with a single input file but I'm not a big fan of this approach as different perturbations may require different number of iterations to converge so you will get load imbalance. Last but not least, some perturbations may not convergence.
In this case, the code will stop and you won't get any result.

2) Not all the data structures in the DFPT code scale at the level of the memory. At a certain point, one hits the MPI bottleneck that prevents you from running with all the procs of the compute node. In this case, one should consider OpenMP threads (see my comments below).
3) Does the hybrid MPI/openMP parallelisation help for RF calculations? I ask since it doesn't mention this on the website (from what I have found, at least).
OpenMp may help mitigate the MPI bottleneck. The DFPT code is not optimized for OpenMP in the sense that most of the high-level loops are parallelized with MPI still one can use OpenMP at the level of the FFT, BLAS/Lapack and non-local part.
Obviously one should not expect the same scalability as in MPI but 2-4 threads may be beneficial if you are dealing with large systems as this hybrid MPI-OpenMP approach allows one to use all the CPUs on the nodes.
If I remember correctly, one of the bottleneck of DFPT is the routine that orthogonalizes the trial first order wavefunction wrt to the nband GS states. This step is performed with BLAS2 routines and can benefit from OpenMP threads (provided one uses a threaded BLAS library)
We recently added a new tutorial that explains how to activate support for OpenMP with intel and MLK (https://docs.abinit.org/tutorial/compil ... nd-modules).

Hope it helps.

Locked