Test the performance of abinit8 SCF too slow

Documentation, Web site and code modifications

Moderators: baguetl, routerov

Locked
Dominic
Posts: 18
Joined: Mon Jan 21, 2013 4:34 pm

Test the performance of abinit8 SCF too slow

Post by Dominic » Sun Feb 26, 2017 6:16 pm

I compiled abinit8 with atlas and mpi, calculated my system consisting of 26 atoms and 100 bands on 16 threads, calculation time reached about 10min - 1 hr per iterations. Out of my frustrations, I did the same calculation in Quantum Espresso (compiled with the same library on the same pc), I calculated the same systems with 300 bands, calculation time reached 3s - 30s per iterations.

Summary

Abinit: 100 bands 10 mins - 1 hr calculation time/iteration
Quantum Espresso: 300 bands 3s - 30s calculation time/iteration


Conclusion

SCF calcuations in Abinit should be improved, In my previous post, I profiled Abinit and found out that opernlb to be a possible culprit. However, upon seeing the performance of QE there might be something else influencing the performance.

User avatar
gmatteo
Posts: 291
Joined: Sun Aug 16, 2009 5:40 pm

Re: Test the performance of abinit8 SCF too slow

Post by gmatteo » Sun Feb 26, 2017 6:23 pm

my system consisting of 26 atoms and 100 bands on 16 thread


    * OpenMP threads or MPI processes?
    * Could you post the input file?
    * configure options and output of `abinit -b`

Dominic
Posts: 18
Joined: Mon Jan 21, 2013 4:34 pm

Re: Test the performance of abinit8 SCF too slow

Post by Dominic » Sun Feb 26, 2017 6:45 pm

Hi, I only have MPI threading, no threading in linear algebra. Below is my input file for geometry optimization, there is a lot of iterations here, first is iteration for energy convergence then the iteration in structure, but I can fairly check the LOG file and see clearly that the Iterations for energy convergence is very slow

Pseudo potentials comes from abinit source package: Cl.GGA-PBE-paw.abinit and Al.GGA-PBE-paw.abinit

mysystem.in

Code: Select all

ionmov  2              # Use the modified Broyden algorithm
occopt 6
optcell 2
ecutsm 0.5
pawecutdg 18
ntime   100             
tolmxf  5.0d-2
toldff  5.0d-3

[b]nband    100 [/b]

#Definition of the unit cell
acell 7.6495800018 38.0000000000 16.0000000000 angstrom

#Definition of the atom types
ntypat 2
znucl 17 13

#Definition of the atoms
[b]natom 26[/b]           # There are two atoms
typat 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

#Definition of the planewave basis set
ecut 14         # Maximal kinetic energy cut-off, in Hartree

#Definition of the k-point grid
ngkpt 6 1 1 #K points similar to VASP
nshiftk 1 #Definition of K-point generator
shiftk   0.0 0.0 0.0

#Definition of the SCF procedure
nstep 100
diemac 100

xred       0.223574534         0.158213347         0.800000012
     0.776425481         0.158213347         0.800000012
     0.776425481         0.841786623         0.800000012
     0.223574534         0.841786623         0.800000012
     0.332109123         0.209108904         0.800000012
     0.667890906         0.209108904         0.800000012
     0.168449000         0.267287105         0.800000012
     0.831551015         0.267287105         0.800000012
     0.332109123         0.325446516         0.800000012
     0.667890906         0.325446516         0.800000012
     0.168449000         0.383643568         0.800000012
     0.831551015         0.383643568         0.800000012
     0.332109123         0.441821784         0.800000012
     0.667890906         0.441821784         0.800000012
     0.168449000         0.500000000         0.800000012
     0.831551015         0.500000000         0.800000012
     0.332109123         0.558178186         0.800000012
     0.667890906         0.558178186         0.800000012
     0.168449000         0.616356432         0.800000012
     0.831551015         0.616356432         0.800000012
     0.332109123         0.674553514         0.800000012
     0.667890906         0.674553514         0.800000012
     0.168449000         0.732712865         0.800000012
     0.831551015         0.732712865         0.800000012
     0.332109123         0.790891111         0.800000012
     0.667890906         0.790891111         0.800000012


Abinit Configure options:

Code: Select all

 === Build Information === 
  Version       : 8.0.8
  Build target  : x86_64_linux_gnu4.7
  Build date    : 20170224

 === Compiler Suite ===
  C compiler       : gnu4.7
  C++ compiler     : gnu4.7
  Fortran compiler : gnu4.7
  CFLAGS           :   -O3 -mtune=native -march=native  -fPIC
  CXXFLAGS         :   -O3 -mtune=native -march=native  -fPIC
  FCFLAGS          :   -ffree-line-length-none -fPIC
  FC_LDFLAGS       :     -Wl,-z,muldefs

 === Optimizations ===
  Debug level        : no
  Optimization level : aggressive
  Architecture       : intel_xeon

 === Multicore ===
  Parallel build : yes
  Parallel I/O   : yes
  openMP support : no
  GPU support    : no

 === Connectors / Fallbacks ===
  Connectors on : yes
  Fallbacks on  : yes
  DFT flavor    : libxc-fallback+atompaw-fallback+bigdft-fallback+wannier90-fallback
  FFT flavor    : none
  LINALG flavor : atlas
  MATH flavor   : none
  TIMER flavor  : abinit
  TRIO flavor   : netcdf-fallback+etsf_io-fallback

 === Experimental features ===
  Bindings            : @enable_bindings@
  Exports             : yes
  GW double-precision : yes

 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 Default optimizations:
   -O3 -mtune=native -march=native -faggressive-function-elimination -fstack-arrays


 Optimizations for 20_datashare:
   -O0


 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 CPP options activated during the build:

                    CC_GNU                   CXX_GNU                    FC_GNU
 
          HAVE_DFT_ATOMPAW           HAVE_DFT_BIGDFT            HAVE_DFT_LIBXC
 
        HAVE_DFT_WANNIER90 HAVE_FC_ALLOCATABLE_DT...             HAVE_FC_ASYNC
 
  HAVE_FC_COMMAND_ARGUMENT      HAVE_FC_COMMAND_LINE        HAVE_FC_CONTIGUOUS
 
           HAVE_FC_CPUTIME              HAVE_FC_EXIT             HAVE_FC_FLUSH
 
             HAVE_FC_GAMMA            HAVE_FC_GETENV          HAVE_FC_INT_QUAD
 
             HAVE_FC_IOMSG     HAVE_FC_ISO_C_BINDING  HAVE_FC_ISO_FORTRAN_2008
 
        HAVE_FC_LONG_LINES        HAVE_FC_MOVE_ALLOC           HAVE_FC_PRIVATE
 
         HAVE_FC_PROTECTED         HAVE_FC_STREAM_IO            HAVE_FC_SYSTEM
 
          HAVE_FORTRAN2003               HAVE_GW_DPC        HAVE_LIBPAW_ABINIT
 
      HAVE_LIBTETRA_ABINIT               HAVE_LINALG        HAVE_LINALG_SERIAL
 
                  HAVE_MPI                 HAVE_MPI2        HAVE_MPI_INTEGER16
 
               HAVE_MPI_IO HAVE_MPI_TYPE_CREATE_S...             HAVE_OS_LINUX
 
                HAVE_TIMER         HAVE_TIMER_ABINIT            HAVE_TIMER_MPI
 
         HAVE_TIMER_SERIAL         HAVE_TRIO_ETSF_IO          HAVE_TRIO_NETCDF
 
              USE_MACROAVE                                                     
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Atlas was builded properly, works nicely in quantum espresso

configure options:

Code: Select all

enable_debug="no"
enable_optim="aggressive"
prefix="$HOME/MineOS/abinit8-cpu"
FC_LDFLAGS_EXTRA="-Wl,-z,muldefs"
enable_mpi="yes"
enable_mpi_io="yes"
with_mpi_prefix="$HOME/MineOS/openmpi1.6.5-4.7.4"
with_trio_flavor="netcdf+etsf_io"
with_linalg_flavor="atlas"
with_linalg_incs="-I$HOME/MineOS/atlas3-4.7.4/include"
with_linalg_libs="-L$HOME/MineOS/atlas3-4.7.4/lib -lsatlas -lstdc++"
with_dft_flavor="atompaw+bigdft+libxc+wannier90"
enable_exports="yes"
enable_gw_dpc="yes"
enable_memory_profiling="no"


The code optimization is -O3 cause I did use the default code optimization from abinit previously but still got the same thing, by the way, QE was also compiled with -O3 so I thing there is no problem in code optimization.

Some note: SPAMHAUS frequently blocks my IP even though I am on Linux and Browser cache is cleared (Yeah and spamhause complains about me having a WINDOWS spamware, buggy spamhaus tsk tsk), so I might reply a bit slow sooner or later. And oh, In case I missed something from the Input, I have have just wrongly deleted or added some characters while editing. The input works fine, its just the iteration for energy convergence which is slow.

User avatar
gmatteo
Posts: 291
Joined: Sun Aug 16, 2009 5:40 pm

Re: Test the performance of abinit8 SCF too slow

Post by gmatteo » Sun Feb 26, 2017 9:19 pm

A few comments on your test.

First of all, your calculation is done with 4 k-points in the IBZ and the default eigenvalue solver (conjugate-gradient).
The CG eigensolver is very robust but the parallelism is limited to nkpt * nsppol MPI processors.

Note that the code does not stop if you run it with mpi_procs > nkpt * nsppol, it just issues a warning in the log file
In your benchmark there are 12 processes that are completely idle and this explains why the wall-time per iteration is so large.
I suggest to use paral_kgb =1 (the lobpcg eigenvalue solver) to treat this case because in paral_kgb = 1 one can
take advantage of additional levels of parallelism (MPI-FFT with npfft, band parallelism with npband, spinor and atom parallelism)
If I use:

paral_kgb 1
npkpt 4
npfft 4
npband 1

I get much more reasonable results for time/iteration with 100 bands (< 30 seconds).
Obviously you can use more CPUS and activate the band parallelism when the scalability of MPI-FFT begins to saturate

I've noticed that you are using the internal FFT routines and the FFT mesh in your benchmark is not that small.
I suggest an optimized external library for FFT.
In abinit/doc/build/config-examples there are several examples for fftw3 and mkl.
Other examples for HPC clusters are available at https://github.com/abinit/abiconfig

If I use perf to profile the code with 300 bands, I find that the most CPU-demanding sections are:

12.83% abinit abinit [.] opernlb_ylm_
8.81% abinit [unknown] [k] 0xffffffff811b691a
6.77% abinit abinit [.] opernla_ylm_
6.64% abinit libmkl_avx2.so [.] mkl_blas_avx2_xzgemv
6.29% abinit libmkl_avx2.so [.] mkl_blas_avx2_zgemm_zccopy_right6_ea

The other sections score less than 5%.
In a previous post, Jordan wrote that he has been working on a improved version of the opernlb kernels that will hopefully
reduce the time spent for the application of the non-local part.

Best regards,
M

Dominic
Posts: 18
Joined: Mon Jan 21, 2013 4:34 pm

Re: Test the performance of abinit8 SCF too slow

Post by Dominic » Mon Feb 27, 2017 5:01 pm

Hi, all good now, from 10 mins - 1hr into < 1.5 mins, that is an ultimate huge Improvements! Maybe I have got to tweak my build further, however why was this cool feature not automatically optimized by Abinit Or something like Abinit defaults to fastest possible settings unless modified by the user?

Locked