parallel berryopt -1 crashing with more than one node

Moderators: mverstra, aromero

parallel berryopt -1 crashing with more than one node

Postby antonio » Wed Dec 05, 2018 6:02 pm

Dear all,
I compiled abinit 8.10.1 on the salomon and anselm clusters using intel 17.0 compilers, libxc 3.0.0 and the following config parameters (config and make logs attached):

Code: Select all
./configure --prefix=/home/acamm/bin/abinit-8.10.1 \
 --enable-mpi --enable-mpi-io --enable-optim \
 --with-dft-flavor=libxc \
 --with-mpi-level=2 \
 --enable-mpi-inplace \
 --with-trio-flavor=netcdf-fallback \
 --enable-fallbacks \
 --enable-avx-safe-mode \
 --with-fc-vendor=intel \
 --with-fft-flavor=fftw3-mkl \
 --with-fft-libs="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core" \
 --with-linalg-flavor="mkl+scalapack" \
 --with-linalg-libs="-lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm -ldl" \
 FCFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 FFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 CFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 CXXFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 FC=mpiifort \
 CC=mpiicc \
 CXX=mpiicpc


I am trying to run a berry phase calculation in preparation for a geometry relaxation in the presence of electric field. I start the calculation from previously converged wavefunction and density files, attached you find the input. The calculation terminates correctly if I run the calculation on only one node. If I run it on two or more nodes, then abinit crashes without any error message. I tried to compile it by adding/removing one by one and combining them:
1) the enable-mpi-* tags
2) enable-optim,
3) --with-mpi-level=1or 2
4) --enable-zdot-bugfix
5) --enable-avx-safe-mode

in all the cases, the job ends cleanly if I use only one node.
Any suggestion is really appreciated.

Thanks a lot in advance

Antonio Cammarata
Attachments
config.log
(256.02 KiB) Downloaded 2 times
CONF_stdout.log
(38.6 KiB) Downloaded 2 times
ERR_make.log
(393.88 KiB) Downloaded 2 times
berry.in
abinit input
(1.6 KiB) Downloaded 5 times
antonio
 
Posts: 40
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby ebousquet » Thu Dec 06, 2018 10:23 am

Dear Antonio,
I don't see something really wrong in your compilation. It looks like it is a machine architecture specific problem, you could contact the IT guys of the clusters to ask them to get more detailed error message from the machine.
A few questions:
How many k-points do you have in your calculation?
If you do other type of calculations, like relaxation or single point energy, do you have the same problem (to know if this is linked to the E-field or not)?
Best wishes,
Eric
ebousquet
 
Posts: 192
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: parallel berryopt -1 crashing with more than one node

Postby antonio » Thu Dec 06, 2018 11:19 am

Dear Eric,

thanks for your quick answer. The number of kpoints is that in the attached berry.in file
ngkpt 9 9 9

When I run this job on a single node, it terminates cleanly. I then used the converged WFK and DEN files to restart a calculation on 2 nodes where I optimize the same structure with non-null efield. I tried efield, red_efield and red_efieldbar but, in each case, whenever the job enters the computation of the Berry phase, it crashes; if I use only one node, then it continues without any problem (I could not terminate the relaxation because it takes too long with one node). I therefore believe that the problem is related to the Berry phase routine and the parallelization scheme over multiple nodes.
As an update, I run again the berry.in calculation attached before and managed to have the error flushed into a file. When it crashes, the __ABI_MPIABORTFILE__ file contains the following error:


--- !BUG
src_file: m_berryphase_new.F90
src_line: 1009
mpi_rank: 14
message: |
For k-point # 173,
the determinant of the overlap matrix is found to be 0.
...

It therefore seems to me that there is a lack of communication between the nodes such that some part of the overlap matrix is not received by the master node or the result of the overlap integrals are not correctly collected and then zeroed; as a consequence, the determinant is null. This is just an idea, I am not an expert of programming languages.

Thanks again for your help.
Best
Antonio
antonio
 
Posts: 40
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby jzwanzig » Fri Dec 14, 2018 8:05 pm

Hi,
I need more information to give a helpful answer. In particular, how many kpts are you using? How many nodes? PAW or NCPP?

thanks,
Joe
Josef W. Zwanziger
Professor, Department of Chemistry
Canada Research Chair in NMR Studies of Materials
Dalhousie University
Halifax, NS B3H 4J3 Canada
jzwanzig@gmail.com
User avatar
jzwanzig
 
Posts: 498
Joined: Mon Aug 17, 2009 9:25 am


Return to Response calculations

Who is online

Users browsing this forum: No registered users and 3 guests