MPI_Abort error in Abinit 8.0.8  [SOLVED]

Total energy, geometry optimization, DFT+U, spin....

Moderator: bguster

Locked
sheng
Posts: 64
Joined: Fri Apr 11, 2014 3:44 pm

MPI_Abort error in Abinit 8.0.8

Post by sheng » Thu Jul 28, 2016 10:54 am

As the title says I encounter MPI_Abort error for one of my calculation of a supercell.
The Abinit 8.0.8 I use is compiled with IntelMPI (ifort 15.0) and it passed all of the test suites.
The supercell is generated using Phonopy program with the intention to do finite difference calculations. The previous few supercells can be run without any problem.

The errors produced are related MPI_Abort, such as

Code: Select all

application called MPI_Abort(MPI_COMM_WORLD, 13) - process 63
application called MPI_Abort(MPI_COMM_WORLD, 13) - process 68
application called MPI_Abort(MPI_COMM_WORLD, 13) - process 71
application called MPI_Abort(MPI_COMM_WORLD, 13) - process 62
application called MPI_Abort(MPI_COMM_WORLD, 13) - process 66
application called MPI_Abort(MPI_COMM_WORLD, 13) - process 67
application called MPI_Abort(MPI_COMM_WORLD, 13) - process 60
application called MPI_Abort(MPI_COMM_WORLD, 13) - process 65
application called MPI_Abort(MPI_COMM_WORLD, 13) - process 69
application called MPI_Abort(MPI_COMM_WORLD, 13) - process 70
application called MPI_Abort(MPI_COMM_WORLD, 13) - process 61
INTERNAL ERROR: invalid error code 78ea36 (Ring ids do not match) in MPIR_Allreduce_impl:1262
INTERNAL ERROR: invalid error code 58ea36 (Ring ids do not match) in MPIR_Allreduce_impl:1262
INTERNAL ERROR: invalid error code 58ea36 (Ring ids do not match) in MPIR_Allreduce_impl:1262
INTERNAL ERROR: invalid error code 68ea36 (Ring ids do not match) in MPIR_Allreduce_impl:1262
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(1421)......: MPI_Allreduce(sbuf=0xf2a2ea0, rbuf=0xf6941c0, count=516706, MPI_DOUBLE_PRECISION, MPI_SUM, comm=0x84000004) failed
MPIR_Allreduce_impl(1262):

The errors only appear when I activate KGB parallelization using a certain distribution of processors. For example the distribution cause errors

Code: Select all

paral_kgb 1   npkpt 7 npband 12 npfft 1

but the following processor distribution runs without problem

Code: Select all

paral_kgb 1   npkpt 14 npband 4 npfft 1

All other parameters are still the same.

The log file and inout files are attached.
Thank you.
Attachments
log.log
(209.67 KiB) Downloaded 394 times
BaTiO3.in.log
(3.1 KiB) Downloaded 353 times

Jordan
Posts: 282
Joined: Tue May 07, 2013 9:47 am

Re: MPI_Abort error in Abinit 8.0.8  [SOLVED]

Post by Jordan » Fri Jul 29, 2016 9:51 am

Hi,

This probably means you have the usual issue with the lobpcg algorithm which is automatically activated with paral_kgb. Changing the processors distribution changes the way the algorithm works and so avoid or produce an error.
There is currently no cure available for public abinit, but hopefully soon this will change.

Cheers

Jordan

sheng
Posts: 64
Joined: Fri Apr 11, 2014 3:44 pm

Re: MPI_Abort error in Abinit 8.0.8

Post by sheng » Fri Jul 29, 2016 10:28 am

Thanks Jordan for the clarification.

Locked