SCF not converged (but really is) parallel BUG  [SOLVED]

option, parallelism,...

Moderators: fgoudreault, mcote

Forum rules
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
Locked
ldamewood
Posts: 14
Joined: Tue Mar 09, 2010 11:39 pm

SCF not converged (but really is) parallel BUG

Post by ldamewood » Mon Apr 28, 2014 7:41 pm

I ran into a similar problem as this poster from a few years ago but for a response function calculation.
viewtopic.php?f=8&t=2023s
when running MPI jobs on abinit-7.6.3

I have nstep set to 200, the code reaches istep = 50, then exits claiming

Code: Select all

200 was not enough SCF cycles to converge;
potential residual=  7.382E-11 exceeds tolvrs=  1.000E-12

so this seems like the same bug as in the other post. I looked a bit deeper and printed out res2 and tolvrs in 67_common/scprqt.F90 during the loop (choice==2) and after the loop completed (choice==3).

Code: Select all

if( ttolvrs==1 .and. res2<tolvrs .and. (.not.noquit)) then
  if (optres==0) then
    write(message, '(a,a,i5,a,1p,e10.2,a,e10.2,a)' ) ch10,&
&    ' At SCF step',istep,'       vres2   =',res2,' < tolvrs=',tolvrs,' =>converged.'
  else
    write(message, '(a,a,i5,a,1p,e10.2,a,e10.2,a)' ) ch10,&
&    ' At SCF step',istep,'       nres2   =',res2,' < tolvrs=',tolvrs,' =>converged.'
  end if
  call wrtout(ab_out,message,'COLL')
  call wrtout(std_out,message,'COLL')
  write(*,*)'choice2: res2/tolvrs = ',res2,tolvrs ! ADDED
  quit=1
end if

and similarly at the start of choice3

During the loop, the exit criteria was met on 7/8 nodes:

Code: Select all

choice2: res2/tolvrs = 9.666527591608606E-013 1.000000000000000E-012 (7 times)

But after the loop, the exit criteria reported that one node still had the res2 value from the previous loop.

Code: Select all

choice3: res2/tolvrs = 9.666527591608606E-013 1.000000000000000E-012 (7 times)
choice3: res2/tolvrs = 7.382415892658378E-011 1.000000000000000E-012


Additionally, I did not receive the "=>converged" output so I am assuming the node that isn't getting updated is the master node. Any suggestions on how to fix this error is appreciated. Thanks.

Here is my setup:
abinit-7.6.3
intel 12.1 + mkl + mkl-fftw3
mvapich2 1.7 (shared mem)

Code: Select all

../configure \
  --prefix=$home/software/abinit/7.6/intel \
  --enable-64bit-flags \
  --enable-gw-dpc \
  --with-linalg-flavor=mkl \
  --with-fft-flavor=fftw3-mkl \
  --with-mpi-level=2 \
  --enable-mpi-io \
FC="mpiifort -mkl=cluster -debug " \
CC="mpiicc -mkl=cluster -debug " \
CXX="mpiicpc -mkl=cluster -debug "


Then I run using:

Code: Select all

srun -n 8 -p pdebug ~/Downloads/abinit-7.6.3/debug/src/98_main/abinit < run.files >& run.log
Attachments
input.log
Input parameters
(693 Bytes) Downloaded 323 times
run.log
stdout and stderr
(87.21 KiB) Downloaded 305 times
cfg.log
configure script output
(31.74 KiB) Downloaded 315 times

ldamewood
Posts: 14
Joined: Tue Mar 09, 2010 11:39 pm

Re: SCF not converged (but really is) parallel BUG  [SOLVED]

Post by ldamewood » Tue Apr 29, 2014 1:59 am

Changing the -fp-model to strict fixed it. Unfortunately, I knew this was a problem in the past and I forgot there was an easy fix. I assumed this issue had something to do with mpi.
Attachments
out.log
Errors
(51.52 KiB) Downloaded 321 times

User avatar
gmatteo
Posts: 291
Joined: Sun Aug 16, 2009 5:40 pm

Re: SCF not converged (but really is) parallel BUG

Post by gmatteo » Wed Apr 30, 2014 2:20 am

Changing the -fp-model to strict fixed it.


Where have you changed the compilation option, the MPI library, Abinit or both?

ldamewood
Posts: 14
Joined: Tue Mar 09, 2010 11:39 pm

Re: SCF not converged (but really is) parallel BUG

Post by ldamewood » Fri May 09, 2014 10:35 pm

Hi

I only changed the flag in ABINIT. The mpi package I use is maintained by someone else. My new configure script looks like this:

Code: Select all

../configure \
  --prefix=$home/software/abinit/7.6/intel \
  --enable-64bit-flags \
  --enable-gw-dpc \
  --with-linalg-flavor=mkl \
  --with-fft-flavor=fftw3-mkl \
  --with-trio-flavor=netcdf+etsf_io \
  --with-mpi-level=2 \
  --enable-mpi-io \
  --enable-openmp \
FC="mpiifort -mkl=cluster -fp-model strict -openmp" \
CC="mpiicc -mkl=cluster -fp-model strict -openmp" \
CXX="mpiicpc -mkl=cluster -fp-model strict -openmp"


Unfortunately, one file refuses to compile using "-fp-model strict":

Code: Select all

mpiifort -mkl=cluster -fp-model strict -openmp -DHAVE_CONFIG_H -I. -I../../../src/32_util -I../.. -I../../src/incs -I../../../src/incs -I/g/g14/damewood/Downloads/abinit-7.6.3/openmp/fallbacks/exports/include   -free -module /g/g14/damewood/Downloads/abinit-7.6.3/openmp/src/mods  -O2 -xHost -g -extend-source -vec-report0 -noaltparam -nofpscomp  -openmp -c -o m_exp_mat.o ../../../src/32_util/m_exp_mat.F90
0_12459

: catastrophic error: **Internal compiler error: internal abort** Please report this error along with the circumstances in which it occurred in a Software Problem Report.  Note: File and line given may not be explicit cause of this error.
compilation aborted for ../../../src/32_util/m_exp_mat.F90 (code 1)
make: *** [m_exp_mat.o] Error 1


However, I was able to manually compile with "-fp-model source":

Code: Select all

mpiifort -mkl=cluster -fp-model source -openmp -DHAVE_CONFIG_H -I. -I../../../src/32_util -I../.. -I../../src/incs -I../../../src/incs -I/g/g14/damewood/Downloads/abinit-7.6.3/openmp/fallbacks/exports/include   -free -module /g/g14/damewood/Downloads/abinit-7.6.3/openmp/src/mods  -O2 -xHost -g -extend-source -vec-report0 -noaltparam -nofpscomp  -openmp -c -o m_exp_mat.o ../../../src/32_util/m_exp_mat.F90


Update: The compilation error with m_exp_mat is due to a bug with the *shift functions in ifort 12.1 (probably). It works in 12.0 and 13.0.

Locked