parallel berryopt -1 crashing with more than one node

Moderators: mverstra, aromero

parallel berryopt -1 crashing with more than one node

Postby antonio » Wed Dec 05, 2018 6:02 pm

Dear all,
I compiled abinit 8.10.1 on the salomon and anselm clusters using intel 17.0 compilers, libxc 3.0.0 and the following config parameters (config and make logs attached):

Code: Select all
./configure --prefix=/home/acamm/bin/abinit-8.10.1 \
 --enable-mpi --enable-mpi-io --enable-optim \
 --with-dft-flavor=libxc \
 --with-mpi-level=2 \
 --enable-mpi-inplace \
 --with-trio-flavor=netcdf-fallback \
 --enable-fallbacks \
 --enable-avx-safe-mode \
 --with-fc-vendor=intel \
 --with-fft-flavor=fftw3-mkl \
 --with-fft-libs="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core" \
 --with-linalg-flavor="mkl+scalapack" \
 --with-linalg-libs="-lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm -ldl" \
 FCFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 FFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 CFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 CXXFLAGS="-O2 -axCORE-AVX2 -xavx -mkl -fp-model precise -heap-arrays " \
 FC=mpiifort \
 CC=mpiicc \
 CXX=mpiicpc


I am trying to run a berry phase calculation in preparation for a geometry relaxation in the presence of electric field. I start the calculation from previously converged wavefunction and density files, attached you find the input. The calculation terminates correctly if I run the calculation on only one node. If I run it on two or more nodes, then abinit crashes without any error message. I tried to compile it by adding/removing one by one and combining them:
1) the enable-mpi-* tags
2) enable-optim,
3) --with-mpi-level=1or 2
4) --enable-zdot-bugfix
5) --enable-avx-safe-mode

in all the cases, the job ends cleanly if I use only one node.
Any suggestion is really appreciated.

Thanks a lot in advance

Antonio Cammarata
Attachments
config.log
(256.02 KiB) Downloaded 73 times
CONF_stdout.log
(38.6 KiB) Downloaded 77 times
ERR_make.log
(393.88 KiB) Downloaded 73 times
berry.in
abinit input
(1.6 KiB) Downloaded 84 times
antonio
 
Posts: 50
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby ebousquet » Thu Dec 06, 2018 10:23 am

Dear Antonio,
I don't see something really wrong in your compilation. It looks like it is a machine architecture specific problem, you could contact the IT guys of the clusters to ask them to get more detailed error message from the machine.
A few questions:
How many k-points do you have in your calculation?
If you do other type of calculations, like relaxation or single point energy, do you have the same problem (to know if this is linked to the E-field or not)?
Best wishes,
Eric
ebousquet
 
Posts: 303
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: parallel berryopt -1 crashing with more than one node

Postby antonio » Thu Dec 06, 2018 11:19 am

Dear Eric,

thanks for your quick answer. The number of kpoints is that in the attached berry.in file
ngkpt 9 9 9

When I run this job on a single node, it terminates cleanly. I then used the converged WFK and DEN files to restart a calculation on 2 nodes where I optimize the same structure with non-null efield. I tried efield, red_efield and red_efieldbar but, in each case, whenever the job enters the computation of the Berry phase, it crashes; if I use only one node, then it continues without any problem (I could not terminate the relaxation because it takes too long with one node). I therefore believe that the problem is related to the Berry phase routine and the parallelization scheme over multiple nodes.
As an update, I run again the berry.in calculation attached before and managed to have the error flushed into a file. When it crashes, the __ABI_MPIABORTFILE__ file contains the following error:


--- !BUG
src_file: m_berryphase_new.F90
src_line: 1009
mpi_rank: 14
message: |
For k-point # 173,
the determinant of the overlap matrix is found to be 0.
...

It therefore seems to me that there is a lack of communication between the nodes such that some part of the overlap matrix is not received by the master node or the result of the overlap integrals are not correctly collected and then zeroed; as a consequence, the determinant is null. This is just an idea, I am not an expert of programming languages.

Thanks again for your help.
Best
Antonio
antonio
 
Posts: 50
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby jzwanzig » Fri Dec 14, 2018 8:05 pm

Hi,
I need more information to give a helpful answer. In particular, how many kpts are you using? How many nodes? PAW or NCPP?

thanks,
Joe
Josef W. Zwanziger
Professor, Department of Chemistry
Canada Research Chair in NMR Studies of Materials
Dalhousie University
Halifax, NS B3H 4J3 Canada
jzwanzig@gmail.com
User avatar
jzwanzig
 
Posts: 499
Joined: Mon Aug 17, 2009 9:25 am

Re: parallel berryopt -1 crashing with more than one node

Postby antonio » Thu Jan 10, 2019 1:50 am

Dear Joe,
thanks for your answer. I have an update on this. I recompiled abinit with the gnu compilers on the salomon machine I mentioned in my first post and now the calculation of the ab.in file that I attached before ends correctly. So, it looks like that the issue is related to the kind of compilers. I here attach a file containing the configure settings and the compilers and libraries loaded at compilation time for future reference.

Unfortunately, this version doesn't work for a phonon calculation in the presence of electric field. The input file is ab_pho.in; I run it on the Salomon cluster using 2 nodes for a total of 48 cores (24 cores per node). Once the calculation enters DATASET 3, abinit stops producing the errors reported in the job.err file. I attach the abinit output (ab.out) and the standard output (std.out).

I tried to recompile abinit by removing the optimization option, the mpi-io and mpi-inplace ooptions, by using the non-mpi fftw3, but it stops at the same point and with the same errors. I also enabled the debug option but I didn't obtain any further information. Only the serial version works but it is extremely slow.

Thanks a lot for your help

Best

Antonio
Attachments
config_gnu.in
(700 Bytes) Downloaded 66 times
ab_pho.in
(2.32 KiB) Downloaded 70 times
ab.out
(52.26 KiB) Downloaded 63 times
std.out
(573.58 KiB) Downloaded 62 times
job.err
(65.24 KiB) Downloaded 69 times
antonio
 
Posts: 50
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby ebousquet » Sun Jan 27, 2019 11:45 am

Hi Antonio,
If you use PAW, phonons under electric field is not implemented... A clear stop message should be added in the code.
All the best,
Eric
ebousquet
 
Posts: 303
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: parallel berryopt -1 crashing with more than one node

Postby antonio » Mon Jan 28, 2019 10:47 am

Hi Eric,
no, I don't use PAWs, I use the norm-conserving pseudopotentials from the abinit webpage.

Antonio
antonio
 
Posts: 50
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby ebousquet » Mon Jan 28, 2019 11:30 am

Hi Antonio,
OK for not using PAW. In the three dataset you have 50, 260 and 512 k-points while you are running on 48 CPU, which does not make an efficient parallelism. This could also make some calculation crashing depending on the compilation, etc, thought I agree it should not but to be sure I would adapt the number of CPU accordingly (also you have a "Segmentation fault - invalid memory reference." error from the machine).
I also suspect that you don't have enough RAM memory for the dataset 3 (typical case where the error message could end up of a segmentation fault from the machine instead of just saying "not enough memory"...).
All the best,
Eric
ebousquet
 
Posts: 303
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: parallel berryopt -1 crashing with more than one node

Postby antonio » Wed Jan 30, 2019 5:47 pm

Dear Eric,
each computing node on salomon cluster has 128 GB RAM. I tested the job on up to 10 nodes for a total of 240 cores (each node has 24 cores); I also tested the job on more nodes with less than 24 cores per node, so as to have more memory for the processes per node, but the job stops at the same point with the same error. When I was using the intel compilers, I got an error like "the determinant of the overlap matrix is found to be 0.". This does not happen if I run the job in serial, whatever is the compiler I use. So I believe that the problem is related on how the master node collects the information from the remaining other nodes to build some specific matrix; possibly, the matrix is not properly filled and the determinant is null. I think this can also explain the "invalid memory reference": maybe some address is associated to some parts of such matrix but they are never filled and the resulting matrix is missing elements. It's only my guess, I don't know the detail of the code.

If I can do any test to help in finding the problem, do not hesitate to ask.
Thanks a lot in advance for your help.
All the best
Antonio
antonio
 
Posts: 50
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby ebousquet » Fri Feb 01, 2019 6:27 pm

Dear Antonio,
OK, sounds like a fix can be done but I need you to test it.
In the file src/67_common/m_berryphase_new.F90, you have the following lines:

Code: Select all
 
 998              if (sqrt(dtm_k(1)*dtm_k(1) + dtm_k(2)*dtm_k(2)) < tol12) then
 999                write(message,'(a,i5,a,a,a)')&
1000 &               '  For k-point #',ikpt1,',',ch10,&
1001 &               '  the determinant of the overlap matrix is found to be 0.'
1002                MSG_BUG(message)
1003              end if


Could you replace the line 1002 "MSG_BUG(message)" by the following lines:
Code: Select all
write(std_out,*)message,dtm_k(1:2)
if(abs(dtm_k(1))<=1d-12)dtm_k(1)=1d-12
if(abs(dtm_k(2))<=1d-12)dtm_k(2)=1d-12
write(std_out,*)' Changing to:',dtm_k(1:2)


As you can see if the dtm_k(1:2) is found to be "zero" (machine precision...) then the code stop. Here the fix will just replace the value by 10^-12, I'll see later what's the best value to put there to make a definitive fix in the code. This will solve the stop regarding "the determinant of the overlap matrix is found to be 0" and hopefully this will solve the problem...

After these line replacements you'll have to recompile quickly abinit.
Let me know how this goes, fingers crossed!
Eric
ebousquet
 
Posts: 303
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: parallel berryopt -1 crashing with more than one node

Postby antonio » Wed Feb 06, 2019 2:46 pm

Dear Eric,
I recompiled abinit with both intel and gnu compilers after making the changes you suggested but I get exactly the same errors.

Antonio
antonio
 
Posts: 50
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby antonio » Tue Feb 12, 2019 5:28 pm

Dear Eric,
I have an update on this. I get the same error if I run a single point in the presence of electric field (berryopt 4, efield 0.0 0.0 0.0005) and I specify by hand the parallel distribution; the error is the same, no matter the values I use for npkpt, npfft, npband, bandpp, either if I follow the suggestions from autoparal or not. I digged a bit into the code and it looks like that the "invalid memory reference" is related to a bad definition of the mpi_enreg%* variables; the code stops in src/52_fft_mpi_noabirule/m_fftcore.F90 at the line
mpi_enreg%my_kgtab(i1,ikpt_this_proc) = kg_ind(i1)
when I do the dftp calculation and at the line
np_band=1; if(mpi_enreg%paral_kgb==1) np_band=max(1,mpi_enreg%nproc_band)
when I just do an SCF with berryopt 4 and I turn on the manual parallelization with paral_kgb 1 and the related keywords.

I hope this helps to solve the issue.

Thanks
Antonio
antonio
 
Posts: 50
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby ebousquet » Thu Feb 21, 2019 6:39 pm

Dear Antonio,
It seems there might be a problem with memory, I'm looking at it.
I would like to clarify one thing, you said that at some point you manage to get the error message that terminates by "the determinant of the overlap matrix is found to be 0.". The fix I've sent you should overpass this problem, is it the case (this is not a memory problem)? Or do you have instead the memory problem?
Best wishes,
Eric
ebousquet
 
Posts: 303
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: parallel berryopt -1 crashing with more than one node

Postby ebousquet » Thu Feb 21, 2019 6:55 pm

OK, with the enable_debug='naughty' and with_fno_backtrace="yes" compilation mode I get a bit more details regarding where it crashes exactly:
Code: Select all
 kpgio: loop on k-points done in parallel
At line 481 of file m_dfpt_fef.F90
Fortran runtime error: Index '126613' of dimension 1 of array 'pwindall' above upper bound of 126612

Error termination. Backtrace:
At line 481 of file m_dfpt_fef.F90
Fortran runtime error: Index '126613' of dimension 1 of array 'pwindall' above upper bound of 126612

Error termination. Backtrace:
At line 481 of file m_dfpt_fef.F90
Fortran runtime error: Index '126613' of dimension 1 of array 'pwindall' above upper bound of 126612

Error termination. Backtrace:
At line 481 of file m_dfpt_fef.F90
Fortran runtime error: Index '126613' of dimension 1 of array 'pwindall' above upper bound of 126612


...
Last edited by ebousquet on Wed Mar 27, 2019 6:29 pm, edited 2 times in total.
ebousquet
 
Posts: 303
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: parallel berryopt -1 crashing with more than one node

Postby antonio » Sun Feb 24, 2019 7:14 pm

ebousquet wrote:Dear Antonio,
It seems there might be a problem with memory, I'm looking at it.
I would like to clarify one thing, you said that at some point you manage to get the error message that terminates by "the determinant of the overlap matrix is found to be 0.". The fix I've sent you should overpass this problem, is it the case (this is not a memory problem)? Or do you have instead the memory problem?
Best wishes,
Eric


I did the test with your fix and it didn't work (see my post above on Feb 6). I still have the memory problem which, after your last post, it looks like related to a bad sizing of some matrix, consistently with what I saw (see my post on Feb 12). If there is some test I can do to help, please, let me know.
Thanks for your help.

Antonio
antonio
 
Posts: 50
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby ebousquet » Tue Feb 26, 2019 5:16 pm

Dear Antonio,
OK, there is indeed a problem in the implementation of this m_dfpt_fef.F90 in parallel, I don't even understand how the guy who coded that could have a code working in parallel, there is a problem with the number of k-point mkmem used for mpi and the total number of k-point nkpt (as Matteo pointed it to me). This is working in sequential because mkmem=nkpt and the summing over k-point is then OK.
I'll try to fix it but this can take some time since I don't know this routine...
Best wishes,
Eric
ebousquet
 
Posts: 303
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium

Re: parallel berryopt -1 crashing with more than one node

Postby antonio » Sun Mar 03, 2019 2:48 pm

Dear Eric,
I'm glad that the problem has been identified, I look forward to news. If I can be of any help with testing, please, let me know.

Best
Antonio
antonio
 
Posts: 50
Joined: Tue Apr 23, 2013 6:16 pm

Re: parallel berryopt -1 crashing with more than one node

Postby ebousquet » Fri Mar 22, 2019 5:58 pm

OK, actually the developer reported in the automatic test v5/t23 that the phonons under electric fields fails if MPI...
So, the bug has never been fixed!
Eric
ebousquet
 
Posts: 303
Joined: Tue Apr 19, 2011 11:13 am
Location: University of Liege, Belgium


Return to Response calculations

Who is online

Users browsing this forum: No registered users and 2 guests

cron