Catastrophically heterogenous memory consumption in large GS run

Total energy, geometry optimization, DFT+U, spin....

Moderator: bguster

Locked
jbde
Posts: 3
Joined: Mon Jun 22, 2020 6:54 pm

Catastrophically heterogenous memory consumption in large GS run

Post by jbde » Fri Aug 14, 2020 5:41 pm

I'm experiencing what feels like strange behaviour with a GS run, and before I file a bug report, I thought I would check here to see if the problem is just me doing something dumb. ;)

I've attached a sample set of input files (.in, .out, .log files) for a GS run I'm trying to calculate. The pseudopotentials are standard JTH PAW ones -- I can't attach them to this post, for some reason. The planewave cutoff is high, and as I increase the size of the k-point mesh, it gets way too large for a single process or even a single node, so it makes sense to distribute everything across a moderate number of processes using MPI.

However, as I increase the number of MPI processes, I notice that past a certain number, the per-process memory load becomes HIGHLY and increasingly heterogeneous. For the case I've attached (96 MPI processes), you can see that the preamble for the log claims that each process should use about 2400MB of memory. My experience is that these estimates are sometimes hilariously wrong, but in this particular case MOST of the processes do show memory pressure around about that figure. However, a single one of the higher-ranked processes -- not the master -- shows consistent memory consumption of 30.2GB(!).

The nodes I'm working with have 32 cores and 192GB RAM each, so I can work around this up to a point by undersubscribing nodes. (Although that's not exactly an efficient use of resources!) However, there comes a point (about 30 x 30 x 30 k-points, from memory) where this one single aberrant process gets so large that it can't fit on a node, so the job becomes absolutely impossible to run, regardless of the overall number of processes or total available distributed memory that I throw at it.

This is all being done with KGB parallelisation and "autoparal 1". It looks like the autoparal heuristics simply allocate all available processors to k-points, which makes sense, so my first thought was that perhaps the FFT grid (or the fine FFT grid for PAW) is sitting on a single process and consuming all that memory. But that doesn't make sense -- the log reports the total number of fine grid points at 2278125, and there's no obvious way I can multiply that by 8-byte doubles and get 30GB.

I've tried this with ABINIT versions 8.10.3 and 9.0.4, compiled with (various combinations of) Intel and GCC compilers, using Intel MPI and OpenMPI, with varying levels of optimisation (from "safe" up to "aggressive") and this happens consistently. The only constant in the configuration except for the ABINIT code itself is the use of MKL for linear algebra and for FFT, so perhaps that's a reason...?

Please note that I'm not really looking for a critique of the sensibility of the job settings (E_cut, number of k-points, etc.) -- although I would be interested if there are any comments! :) Mostly, I just want to know if I'm making an obvious mistake re: parallelisation settings...
Attachments
bug.out
(53.79 KiB) Downloaded 187 times
bug.log
(67.98 KiB) Downloaded 189 times
bug.in
(570 Bytes) Downloaded 200 times

Locked