Page 1 of 1

BUG MPI always crashes

Posted: Wed Aug 16, 2017 3:37 pm
by recohen
I am getting a crash after the first iteration for a large system with LDA+U and DF2 vdW and have not been able to get past this changing many things. The error looks like:

ETOT 1 -397.67895452777 -3.977E+02 2.061E-03 8.769E+05
scprqt: <Vxc>= -2.6931088E-01 hartree

Simple mixing update:
residual square of the potential : 1.193092724287603E-002
Fatal error in MPI_Allreduce: Message truncated, error stack:
MPI_Allreduce(1555).................: MPI_Allreduce(sbuf=0x7ffd5fc4ceb4, rbuf=0x7ffd5fc4cca0, count=1, MPI_INTEGER, MPI_SUM, comm=0xc40000f4) failed
MPIR_Allreduce_impl(1396)...........: fail failed
MPIR_Allreduce_intra(890)...........: fail failed


I attached input. The other attachments are not allowed on the site.

Ron Cohen

Re: BUG MPI always crashes

Posted: Fri Aug 18, 2017 4:41 pm
by recohen
I finally got an actual usable error message. It is:
-- !BUG
src_file: m_pawrhoij.F90
src_line: 1179
mpi_rank: 0
message: |
Wrong sizes sum[nrhoij_ij]/=nrhoij_out !
...


leave_new: decision taken to exit ...
application called MPI_Abort(MPI_COMM_WORLD, 13) - process 0

Re: BUG MPI always crashes

Posted: Tue Aug 22, 2017 4:28 pm
by torrent
Dear Ron,

I think I have found a bug fix from your test case.
Apparently, the parallelism over atoms (automatically activated) was not compatible with prtden<0.
So, I can propose 3 ways to run your case:
1- deactivate parallelism over atoms, setting paral_atom 0 in the input file (will be more memory consuming).
2- don't use prtden<0 if not necessary
3- change source code: I
In src/*/scfcv.F90 file, replace the following piece of code:

Code: Select all

     if(dtset%prtden<0.or.dtset%prtkden<0) then
!      Update the content of the header (evolving variables)
       bantot=hdr%bantot
       if (dtset%positron==0) then
         call hdr_update(hdr,bantot,etotal,energies%e_fermie,residm,&
&         rprimd,occ,pawrhoij,xred,dtset%amu_orig(:,1),&
&         comm_atom=mpi_enreg%comm_atom,mpi_atmtab=mpi_enreg%my_atmtab)
       else
         call hdr_update(hdr,bantot,electronpositron%e0,energies%e_fermie,residm,&
&         rprimd,occ,pawrhoij,xred,dtset%amu_orig(:,1),&
&         comm_atom=mpi_enreg%comm_atom,mpi_atmtab=mpi_enreg%my_atmtab)
       end if
     end if
by

Code: Select all

     if(dtset%prtden<0.or.dtset%prtkden<0) then
!      Update the content of the header (evolving variables)
       bantot=hdr%bantot
       if (dtset%positron==0) then
         call hdr_update(hdr,bantot,etotal,energies%e_fermie,residm,&
&         rprimd,occ,pawrhoij,xred,dtset%amu_orig(:,1))
       else
         call hdr_update(hdr,bantot,electronpositron%e0,energies%e_fermie,residm,&
&         rprimd,occ,pawrhoij,xred,dtset%amu_orig(:,1))
       end if
     end if


I also found an inconsistency in your input file:
PAW is not compatible with vdw_xc=2 (only vdw_xc=5,6,7).
This was not warned by the code (I have change this now for the next version).
I don't know what the code is doing if you put vdw_xc=2 + PAW... apparently it doesn't crash but...

Regards,

Re: BUG MPI always crashes

Posted: Tue Aug 22, 2017 5:59 pm
by recohen
Thank you so much! So as I understand it, there is way to do DF2 and DFT+U, because I think DFT+U requires PAWs. This kind of sinks what I wanted to do. I either have to drop the +U or use a different vdw functional, if I understand properly. Thank you so much for figuring this all out!

Ron

Re: BUG MPI always crashes

Posted: Wed Aug 23, 2017 12:26 pm
by recohen
Again, thanks for ll the help! But I seem to be having the same or similar problem after setting prtden to 1, switching to ONCV norm-conserving pseudopotentials, setting up a spin-polarized computation explicitly, and getting rid of +U.

ITER STEP NUMBER 1
vtorho : nnsclo_now=2, note that nnsclo,dbl_nnsclo,istep=0 0 1
You should try to get npband*bandpp= 49
For information matrix size is 208468
You should try to get npband*bandpp= 49
For information matrix size is 208468

[0] WARNING: GLOBAL:DEADLOCK:NO_PROGRESS: warning
[0] WARNING: Processes have been blocked on average inside MPI for the last 5:00 minutes:
[0] WARNING: either the application has a load imbalance or a deadlock which is not detected
[0] WARNING: because at least one process polls for message completion instead of blocking
[0] WARNING: inside MPI.
[0] WARNING: [0] no progress observed for over 0:00 minutes, process is currently in MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2adeeb6267c0, *sendcounts=0x2ade6d07e8e0, *sdispls=0x2ade6d07e980, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2adeebfbe100, *recvcounts=0x2ade6d07e8c0, *rdispls=0x2ade6d07e920, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc400000e CART_SUB CART_CREATE CREATE COMM_WORLD [0:1], *ierr=0x7ffeb9769374)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [1] no progress observed for over 0:00 minutes, process is currently in MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2af022e26760, *sendcounts=0x2aefa42f6920, *sdispls=0x2aefa42f69c0, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2af02348d100, *recvcounts=0x2aefa42f6900, *rdispls=0x2aefa42f6960, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [0:1], *ierr=0x7ffe845494f4)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [2] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2b0eac2efae0, *sendcounts=0x2b0e321e2920, *sdispls=0x2b0e321e29c0, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b0eacc88d00, *recvcounts=0x2b0e321e2900, *rdispls=0x2b0e321e2960, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [2:3], *ierr=0x7ffeecc16274)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [3] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2ac278c82b20, *sendcounts=0x2ac1fec1a920, *sdispls=0x2ac1fec1a9c0, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2ac27961bd40, *recvcounts=0x2ac1fec1a900, *rdispls=0x2ac1fec1a960, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [2:3], *ierr=0x7ffd4990e8f4)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [4] no progress observed for over 0:00 minutes, process is currently in MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2ab905021300, *sendcounts=0x2ab88aca6900, *sdispls=0x2ab88aca6960, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2ab903fb7a80, *recvcounts=0x2ab88aca6920, *rdispls=0x2ab88aca69c0, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [4:5], *ierr=0x7fff881a2474)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [5] no progress observed for over 0:00 minutes, process is currently in MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2b4a4b48e340, *sendcounts=0x2b49d04d6900, *sdispls=0x2b49d04d6960, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b4a4a424ac0, *recvcounts=0x2b49d04d6920, *rdispls=0x2b49d04d69c0, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [4:5], *ierr=0x7ffdbd2992f4)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [6] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2b0ea0556c20, *sendcounts=0x2b0e26416920, *sdispls=0x2b0e264169c0, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b0ea0eee480, *recvcounts=0x2b0e26416900, *rdispls=0x2b0e26416960, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [6:7], *ierr=0x7ffc260357f4)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [7] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2aed8d9a6b80, *sendcounts=0x2aed135aa920, *sdispls=0x2aed135aa9c0, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2aed8e33e480, *recvcounts=0x2aed135aa900, *rdispls=0x2aed135aa960, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [6:7], *ierr=0x7fff6b3d0974)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [8] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2b110ee27680, *sendcounts=0x2b1094582920, *sdispls=0x2b10945829c0, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b110f48d700, *recvcounts=0x2b1094582900, *rdispls=0x2b1094582960, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [8:9], *ierr=0x7ffe3cfecef4)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [9] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2b780008c620, *sendcounts=0x2b7785eb2920, *sdispls=0x2b7785eb29c0, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b7800a23700, *recvcounts=0x2b7785eb2900, *rdispls=0x2b7785eb2960, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [8:9], *ierr=0x7ffd00d81374)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [10] last MPI call:
[0] WARNING: mpi_comm_size_(comm=0xffffffffc4000005 CART_SUB CART_CREATE CREATE COMM_WORLD [10], *size=0x2b8336322f6c, *ierr=0x2b8336324634)
[0] WARNING: [11] last MPI call:
[0] WARNING: mpi_comm_size_(comm=0xffffffffc4000005 CART_SUB CART_CREATE CREATE COMM_WORLD [11], *size=0x2b1af06f6f6c, *ierr=0x2b1af06f8634)
[0] WARNING: [12] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2ba6f19a7420, *sendcounts=0x2ba677ba6960, *sdispls=0x2ba677ba6900, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2ba6f3c0d140, *recvcounts=0x2ba677ba69c0, *rdispls=0x2ba677ba6920, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [12:13], *ierr=0x7ffe70f15574)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [13] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2ae403626420, *sendcounts=0x2ae3894c2960, *sdispls=0x2ae3894c2900, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2ae40588c140, *recvcounts=0x2ae3894c29c0, *rdispls=0x2ae3894c2920, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [12:13], *ierr=0x7ffe27d8f374)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [14] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2b7766e27240, *sendcounts=0x2b76f01ca960, *sdispls=0x2b76f01ca900, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b76f2cb7380, *recvcounts=0x2b76f01ca9c0, *rdispls=0x2b76f01ca920, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [14:15], *ierr=0x7ffe99c55ef4)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [15] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2b5162e27240, *sendcounts=0x2b50e8d2e960, *sdispls=0x2b50e8d2e900, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b50eb81b380, *recvcounts=0x2b50e8d2e9c0, *rdispls=0x2b50e8d2e920, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [14:15], *ierr=0x7fff1c49f774)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [16] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2b8b86e27240, *sendcounts=0x2b8b0c21e960, *sdispls=0x2b8b0c21e900, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b8b0f454700, *recvcounts=0x2b8b0c21e9c0, *rdispls=0x2b8b0c21e920, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [16:17], *ierr=0x7ffdad5caa74)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [17] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2abb419a6280, *sendcounts=0x2abac77e6960, *sdispls=0x2abac77e6900, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2abb43c0b740, *recvcounts=0x2abac77e69c0, *rdispls=0x2abac77e6920, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [16:17], *ierr=0x7ffc23009174)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [18] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2ab4b5da83e0, *sendcounts=0x2ab43bf92960, *sdispls=0x2ab43bf92900, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2ab4414d6100, *recvcounts=0x2ab43bf929c0, *rdispls=0x2ab43bf92920, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [18:19], *ierr=0x7ffe479617f4)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [19] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2b407fa26420, *sendcounts=0x2b40057ba960, *sdispls=0x2b40057ba900, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b4081c8c140, *recvcounts=0x2b40057ba9c0, *rdispls=0x2b40057ba920, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [18:19], *ierr=0x7ffd5f9625f4)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [20] no progress observed for over 0:00 minutes, process is currently in MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2adcbee26780, *sendcounts=0x2adc441ce920, *sdispls=0x2adc441ce9c0, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2adc47a669c0, *recvcounts=0x2adc441ce900, *rdispls=0x2adc441ce960, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [20:21], *ierr=0x7fff0a328874)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [21] no progress observed for over 0:00 minutes, process is currently in MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2b708ee26740, *sendcounts=0x2b701475a920, *sdispls=0x2b701475a9c0, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b708f48e980, *recvcounts=0x2b701475a900, *rdispls=0x2b701475a960, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [20:21], *ierr=0x7ffec154bd74)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [22] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2b6e46e27680, *sendcounts=0x2b6dccb5a960, *sdispls=0x2b6dccb5a900, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b6e4748d840, *recvcounts=0x2b6dccb5a9c0, *rdispls=0x2b6dccb5a920, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [22:23], *ierr=0x7ffef3f170f4)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [23] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2ba07c08c620, *sendcounts=0x2ba001ec6960, *sdispls=0x2ba001ec6900, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2ba07c6f2840, *recvcounts=0x2ba001ec69c0, *rdispls=0x2ba001ec6920, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000007 CART_SUB CART_CREATE CREATE COMM_WORLD [22:23], *ierr=0x7ffe1b2644f4)
[0] WARNING: m_xmpi_mp_xmpi_alltoallv_dp2d_ (/mnt/beegfs/bin/abinit)
[0] WARNING: [24] last MPI call:
[0] WARNING: mpi_comm_size_(comm=0xffffffffc4000005 CART_SUB CART_CREATE CREATE COMM_WORLD [24], *size=0x2af9a73c8f6c, *ierr=0x2af9a73ca634)
[0] WARNING: [25] last MPI call:
[0] WARNING: mpi_comm_size_(comm=0xffffffffc4000005 CART_SUB CART_CREATE CREATE COMM_WORLD [25], *size=0x2b1c58b9df6c, *ierr=0x2b1c58b9f634)
[0] WARNING: [26] last MPI call:
[0] WARNING: mpi_wtime_()
[0] WARNING: [27] last MPI call:
[0] WARNING: mpi_wtime_()
[0] WARNING: [28] last MPI call:
[0] WARNING: mpi_alltoallv_(*sendbuf=0x2abcc6e25b20, *sendcounts=0x2abc4c6be920, *sdispls=0x2abc4c6be9c0, se

This is with files:
C20H16I4Fe2_AFM.in
C20H16I4Fe2_AFM.out
gs_i
gs_o
gs_g
/mnt/beegfs/rcohen/PSEUDO/Fe.psp8
/mnt/beegfs/rcohen/PSEUDO/Fe.psp8
/mnt/beegfs/rcohen/PSEUDO/C.psp8
/mnt/beegfs/rcohen/PSEUDO/H.psp8
/mnt/beegfs/rcohen/PSEUDO/I.psp8

and the attached input file

Re: BUG MPI always crashes

Posted: Wed Aug 23, 2017 12:55 pm
by recohen
I recompiled 8.4.3 with the debugging intel MPI library and now it is running. Not sure if that was the fix, or if something else was corrupted in the build. I simply did
export I_MPI_LINK=dbg_mt
configured and make
We will see.

Re: BUG MPI always crashes

Posted: Tue Jan 02, 2018 5:49 pm
by recohen
Things are worse than I realized. After much work with abinit, seems the vdw_xc 2 flag (or 1) does nothing! This is true even in the test case abinit-8.6.1/tests/vdwxc where running without all the vdw flags in the example gives exactly the same result as running with them! It would be great if someone could look at this.