parallel berryopt -1 crashing with more than one node

Moderators: mverstra, aromero

parallel berryopt -1 crashing with more than one node

Postby antonio » Wed Dec 05, 2018 6:02 pm

Dear all,
I compiled abinit 8.10.1 on the salomon and anselm clusters using intel 17.0 compilers, libxc 3.0.0 and the following config parameters (config and make logs attached):

Code: Select all
./configure --prefix=/home/acamm/bin/abinit-8.10.1 \
 --enable-mpi --enable-mpi-io --enable-optim \
 --with-dft-flavor=libxc \
 --with-mpi-level=2 \
 --enable-mpi-inplace \
 --with-trio-flavor=netcdf-fallback \
 --enable-fallbacks \
 --enable-avx-safe-mode \
 --with-fc-vendor=intel \
 --with-fft-flavor=fftw3-mkl \
 --with-fft-libs="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core" \
 --with-linalg-flavor="mkl+scalapack" \
 --with-linalg-libs="-lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm -ldl" \
 FCFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 FFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 CFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 CXXFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 FC=mpiifort \
 CC=mpiicc \

I am trying to run a berry phase calculation in preparation for a geometry relaxation in the presence of electric field. I start the calculation from previously converged wavefunction and density files, attached you find the input. The calculation terminates correctly if I run the calculation on only one node. If I run it on two or more nodes, then abinit crashes without any error message. I tried to compile it by adding/removing one by one and combining them:
1) the enable-mpi-* tags
2) enable-optim,
3) --with-mpi-level=1or 2
4) --enable-zdot-bugfix
5) --enable-avx-safe-mode

in all the cases, the job ends cleanly if I use only one node.
Any suggestion is really appreciated.

Thanks a lot in advance

Antonio Cammarata
(256.02 KiB) Downloaded 34 times
(38.6 KiB) Downloaded 33 times
(393.88 KiB) Downloaded 35 times
abinit input
(1.6 KiB) Downloaded 37 times
Posts: 41
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby ebousquet » Thu Dec 06, 2018 10:23 am

Dear Antonio,
I don't see something really wrong in your compilation. It looks like it is a machine architecture specific problem, you could contact the IT guys of the clusters to ask them to get more detailed error message from the machine.
A few questions:
How many k-points do you have in your calculation?
If you do other type of calculations, like relaxation or single point energy, do you have the same problem (to know if this is linked to the E-field or not)?
Best wishes,
Posts: 193
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: parallel berryopt -1 crashing with more than one node

Postby antonio » Thu Dec 06, 2018 11:19 am

Dear Eric,

thanks for your quick answer. The number of kpoints is that in the attached file
ngkpt 9 9 9

When I run this job on a single node, it terminates cleanly. I then used the converged WFK and DEN files to restart a calculation on 2 nodes where I optimize the same structure with non-null efield. I tried efield, red_efield and red_efieldbar but, in each case, whenever the job enters the computation of the Berry phase, it crashes; if I use only one node, then it continues without any problem (I could not terminate the relaxation because it takes too long with one node). I therefore believe that the problem is related to the Berry phase routine and the parallelization scheme over multiple nodes.
As an update, I run again the calculation attached before and managed to have the error flushed into a file. When it crashes, the __ABI_MPIABORTFILE__ file contains the following error:

--- !BUG
src_file: m_berryphase_new.F90
src_line: 1009
mpi_rank: 14
message: |
For k-point # 173,
the determinant of the overlap matrix is found to be 0.

It therefore seems to me that there is a lack of communication between the nodes such that some part of the overlap matrix is not received by the master node or the result of the overlap integrals are not correctly collected and then zeroed; as a consequence, the determinant is null. This is just an idea, I am not an expert of programming languages.

Thanks again for your help.
Posts: 41
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby jzwanzig » Fri Dec 14, 2018 8:05 pm

I need more information to give a helpful answer. In particular, how many kpts are you using? How many nodes? PAW or NCPP?

Josef W. Zwanziger
Professor, Department of Chemistry
Canada Research Chair in NMR Studies of Materials
Dalhousie University
Halifax, NS B3H 4J3 Canada
User avatar
Posts: 498
Joined: Mon Aug 17, 2009 9:25 am

Re: parallel berryopt -1 crashing with more than one node

Postby antonio » Thu Jan 10, 2019 1:50 am

Dear Joe,
thanks for your answer. I have an update on this. I recompiled abinit with the gnu compilers on the salomon machine I mentioned in my first post and now the calculation of the file that I attached before ends correctly. So, it looks like that the issue is related to the kind of compilers. I here attach a file containing the configure settings and the compilers and libraries loaded at compilation time for future reference.

Unfortunately, this version doesn't work for a phonon calculation in the presence of electric field. The input file is; I run it on the Salomon cluster using 2 nodes for a total of 48 cores (24 cores per node). Once the calculation enters DATASET 3, abinit stops producing the errors reported in the job.err file. I attach the abinit output (ab.out) and the standard output (std.out).

I tried to recompile abinit by removing the optimization option, the mpi-io and mpi-inplace ooptions, by using the non-mpi fftw3, but it stops at the same point and with the same errors. I also enabled the debug option but I didn't obtain any further information. Only the serial version works but it is extremely slow.

Thanks a lot for your help


(700 Bytes) Downloaded 29 times
(2.32 KiB) Downloaded 28 times
(52.26 KiB) Downloaded 28 times
(573.58 KiB) Downloaded 28 times
(65.24 KiB) Downloaded 31 times
Posts: 41
Joined: Tue Apr 23, 2013 6:16 pm

Return to Response calculations

Who is online

Users browsing this forum: No registered users and 3 guests