BUG abinit crashes for paral_kgb=1

Documentation, Web site and code modifications

Moderators: baguetl, routerov

Locked
recohen
Posts: 36
Joined: Tue Apr 30, 2013 10:48 pm

BUG abinit crashes for paral_kgb=1

Post by recohen » Fri Nov 10, 2017 11:30 am

This job runs OK with only k-point parallelization, but crashes for all sets of other parallelization I have tried. I used
autoparal 1
paral_kgb 1
max_ncpus=128
to find recommended sets of parameters, and they all crash for the latest version 8.6.1 as well as earlier versions.
This is with the latest Intel compilers and mkl (2018)
as well as earlier versions of intel.

A sample traceback from:
autoparal 0
paral_kgb 1
#max_ncpus=128
npband 4
bandpp 1
npkpt 32
npfft 1
gives
getcut: wavevector= 0.0000 0.0000 0.0000 ngfft= 80 120 90
ecut(hartree)= 50.000 => boxcut(ratio)= 1.96423

ITER STEP NUMBER 1
vtorho : nnsclo_now=2, note that nnsclo,dbl_nnsclo,istep=0 0 1
You should try to get npband*bandpp= 96
For information matrix size is 58376

[48] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[48] ERROR: Fatal signal 11 (SIGSEGV) raised.
[48] ERROR: Signal was encountered at:
[48] ERROR: <<no file/line information found>>
[48] ERROR: After leaving:
[48] ERROR: mpi_alltoallv_(*sendbuf=0x2b3f5b191b00, *sendcounts=0x2b3f4ecc9aa0, *sdispls=0x2b3f4ecc9a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b3f5ab4d4c0, *recvcounts=0x2b3f4ecc9a20, *rdispls=0x2b3f4ecc9a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [48:51], *ierr=0x7ffc076ebbb0->MPI_SUCCESS)

[51] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[51] ERROR: Fatal signal 11 (SIGSEGV) raised.
[51] ERROR: Signal was encountered at:
[51] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[51] ERROR: <<1 stack level with no information>>
[51] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[51] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[51] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.:
128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[51] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[51] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[51] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[51] ERROR: After leaving:
[51] ERROR: mpi_alltoallv_(*sendbuf=0x2b8630f75b00, *sendcounts=0x2b86247ddaa0, *sdispls=0x2b86247dda80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b8630932000, *recvcounts=0x2b86247dda20, *rdispls=0x2b86247dda40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [48:51], *ierr=0x7fffd4965bb0->MPI_SUCCESS)

[63] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[63] ERROR: Fatal signal 11 (SIGSEGV) raised.
[63] ERROR: Signal was encountered at:
[63] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[63] ERROR: <<1 stack level with no information>>
[63] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[63] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[63] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[63] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[63] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[63] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[63] ERROR: After leaving:
[63] ERROR: mpi_alltoallv_(*sendbuf=0x2ab417862100, *sendcounts=0x2ab40b555aa0, *sdispls=0x2ab40b555a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2ab417224100, *recvcounts=0x2ab40b555a20, *rdispls=0x2ab40b555a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 C:
RT_SUB CART_CREATE COMM_WORLD [60:63], *ierr=0x7ffd685fbd30->MPI_SUCCESS)
[0] WARNING: starting premature shutdown

[112] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[112] ERROR: Fatal signal 11 (SIGSEGV) raised.
[112] ERROR: Signal was encountered at:
[112] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[112] ERROR: <<1 stack level with no information>>
[112] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[112] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[112] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[112] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[112] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[112] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[112] ERROR: After leaving:
[112] ERROR: mpi_alltoallv_(*sendbuf=0x2b2ab24f0100, *sendcounts=0x2b2aa5ee9aa0, *sdispls=0x2b2aa5ee9a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b2ab1eb2100, *recvcounts=0x2b2aa5ee9a20, *rdispls=0x2b2aa5ee9a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [112:115], *ierr=0x7ffca337fab0->MPI_SUCCESS)
Attachments
F10C20H10Fe2.in
(3.27 KiB) Downloaded 441 times

recohen
Posts: 36
Joined: Tue Apr 30, 2013 10:48 pm

Re: BUG abinit crashes for paral_kgb=1 *SOLVED*

Post by recohen » Sat Nov 11, 2017 6:48 pm

I recompiled everything with no openmp and it works fine now!

recohen
Posts: 36
Joined: Tue Apr 30, 2013 10:48 pm

Re: BUG abinit crashes for paral_kgb=1 (*Not Solved*)

Post by recohen » Sun Nov 12, 2017 10:15 am

I spoke too soon. The self-consistency normally runs with paral_kgb 1 (that even occasionally dies) , but the berry's phase parts (data sets 3 and 4)
always crash with allocation problems with paral_kgp 1 . I tried many different sets of processors, autoparal, and manual setting of parallelization parameters.
It seems to be dying in m_fftcore/kpgsph , but it is hard to get a good traceback as memory is corrupted.
Example errors:
1] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[1] ERROR: Fatal signal 11 (SIGSEGV) raised.
[1] ERROR: Signal was encountered at:
[1] ERROR: m_fftcore_mp_kpgsph_ (/mnt/beegfs/bin/abinit)
[1] ERROR: After leaving:
[1] ERROR: mpi_comm_rank_(comm=MPI_COMM_WORLD, *rank=0x7ffece8cf6c0->1, *ierr=0x7ffece8cf6c4->MPI_SUCCESS)

IO operation completed. cpu_time: 0.1 [s], walltime: 0.1 [s]
initberry: for direction 1, nkstr = 4, nstr = 16
initberry: for direction 2, nkstr = 4, nstr = 16
initberry: for direction 3, nkstr = 4, nstr = 16
*** Error in `/mnt/beegfs/bin/abinit': malloc(): smallbin double linked list corrupted: 0x000000000c553f10 ***
*** Error in `/mnt/beegfs/bin/abinit': malloc(): smallbin double linked list corrupted: 0x000000000d24bb60 ***


IO operation completed. cpu_time: 0.0 [s], walltime: 0.1 [s]
initberry: for direction 1, nkstr = 4, nstr = 16
initberry: for direction 2, nkstr = 4, nstr = 16
initberry: for direction 3, nkstr = 4, nstr = 16
Relative gap for number of plane waves between process (%): 0.24
Relative gap for number of plane waves between process (%): 0.26
Relative gap for number of plane waves between process (%): 0.16
Relative gap for number of plane waves between process (%): 0.16
*** Error in `/mnt/beegfs/bin/abinit': malloc(): memory corruption: 0x000000000d32cba0 ***
Fatal error in MPI_Allreduce: Unknown error class, error stack:
MPI_Allreduce(1552)...................: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0xd2a91a0, count=9216, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(1393).............: fail failed
MPIR_Allreduce_intra(975).............: fail failed
MPIDU_Complete_posted_with_error(1710): Process failed
MPIR_Allreduce_intra(1040)............: fail failed
MPIC_Sendrecv(576)....................: fail failed
MPIDI_CH3_EagerContigIsend(677).......: failure occurred while attempting to send an eager message
MPIDI_CH3_iSendv(37)..................: Communication error with rank 9
Fatal error in MPI_Allreduce: Unknown error class, error stack:
MPI_Allreduce(1552)...................: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0xc37a900, count=9216, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(1393).............: fail failed
MPIR_Allreduce_intra(975).............: fail failed
MPIDU_Complete_posted_with_error(1710): Process failed
MPIR_Allreduce_intra(1040)............: fail failed
MPIC_Sendrecv(576)....................: fail failed
:

Locked