ABINIT 8.10.3 with GPU, MKL and MAGMA - segmentation fault
Posted: Fri Oct 18, 2019 2:03 pm
Hi,
I have compiled ABINIT 8.10.3 with GPU enabled and with MKL + MAGMA. Following are the settings in config.ac
enable_mpi="yes"
with_mpi_level="2"
with_mpi_prefix="$MPI_HOME"
enable_gpu="yes"
with_gpu_flavor="cuda-double"
with_gpu_incs="-I$CUDA_HOME/include/"
with_gpu_libs="-L$CUDA_HOME/lib64/ -lcublas -lcufft -lcudart -lstdc++"
with_gpu_cppflags="-DHAVE_GPU_MPI"
with_linalg_flavor="mkl+magma"
with_linalg_incs="-I$MKLROOT/include/intel64/lp64 -I$MKLROOT/include -I/home/nvydyanathan/Work/DMRL/ABINIT/new-install/magma-2.5.1/build/include"
with_linalg_libs="-L$MKLROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -L/home/nvydyanathan/Work/DMRL/ABINIT/new-install/magma-2.5.1/build/lib -lmagma"
Modules loaded are
Currently Loaded Modulefiles:
1) GCCcore/5.4.0 7) PrgEnv/GCC+OpenMPI/2018-05-24
2) binutils/2.26-GCCcore-5.4.0 8) gcc/7.3.0
3) GCC/5.4.0-2.26 9) hwloc/1.11.10
4) OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1 10) openmpi/2.1.3
5) cuda/10.1.105 11) mkl/2017-beta
6) slurm/16.05.0
make test_fast gives a segmentation fault:
backtrace in gdb is as follows:
ABINIT 8.10.3
Give name for formatted input file:
testin_fast.in
Give name for formatted output file:
testin_fast.out
Give root name for generic input files:
testin_fast_i
Give root name for generic output files:
testin_fast_o
Give root name for generic temporary files:
testin_fast_tmp
Program received signal SIGSEGV, Segmentation fault.
_gfortrani_next_record (dtp=dtp@entry=0x7fffffff7a00, done=done@entry=1) at ../../../libgfortran/io/transfer.c:3505
3505 ../../../libgfortran/io/transfer.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libibverbs-41mlnx1-OFED.4.3.2.1.6.43302.x86_64 libmlx4-41mlnx1-OFED.4.1.0.1.0.43302.x86_64 libmlx5-41mlnx1-OFED.4.3.2.0.0.43302.x86_64 libnl3-3.2.28-4.el7.x86_64 libpciaccess-0.14-1.el7.x86_64 librdmacm-41mlnx1-OFED.4.2.0.1.3.43302.x86_64 librxe-41mlnx1-OFED.4.1.0.1.7.43302.x86_64 munge-libs-0.5.11-3.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64
(gdb) [dgx03:50755] 2 more processes have sent help message help-mpi-btl-openib.txt / default subnet prefix
[dgx03:50755] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
(gdb) bt
#0 _gfortrani_next_record (dtp=dtp@entry=0x7fffffff7a00, done=done@entry=1) at ../../../libgfortran/io/transfer.c:3505
#1 0x00002aaaaabba3f3 in finalize_transfer (dtp=dtp@entry=0x7fffffff7a00) at ../../../libgfortran/io/transfer.c:3616
#2 0x00002aaaaabba589 in _gfortran_st_write_done (dtp=0x7fffffff7a00) at ../../../libgfortran/io/transfer.c:3747
#3 0x000000000148a66e in m_errors::msg_hndl (
message=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>,
level=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>,
mode_paral=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>,
file=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>, line=<optimized out>,
nodump=<optimized out>, nostop=<optimized out>, _message=<optimized out>, _level=<optimized out>, _mode_paral=<optimized out>,
_file=<optimized out>) at m_errors.F90:901
could you please help resolve this?
thanks,
Naga
I have compiled ABINIT 8.10.3 with GPU enabled and with MKL + MAGMA. Following are the settings in config.ac
enable_mpi="yes"
with_mpi_level="2"
with_mpi_prefix="$MPI_HOME"
enable_gpu="yes"
with_gpu_flavor="cuda-double"
with_gpu_incs="-I$CUDA_HOME/include/"
with_gpu_libs="-L$CUDA_HOME/lib64/ -lcublas -lcufft -lcudart -lstdc++"
with_gpu_cppflags="-DHAVE_GPU_MPI"
with_linalg_flavor="mkl+magma"
with_linalg_incs="-I$MKLROOT/include/intel64/lp64 -I$MKLROOT/include -I/home/nvydyanathan/Work/DMRL/ABINIT/new-install/magma-2.5.1/build/include"
with_linalg_libs="-L$MKLROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -L/home/nvydyanathan/Work/DMRL/ABINIT/new-install/magma-2.5.1/build/lib -lmagma"
Modules loaded are
Currently Loaded Modulefiles:
1) GCCcore/5.4.0 7) PrgEnv/GCC+OpenMPI/2018-05-24
2) binutils/2.26-GCCcore-5.4.0 8) gcc/7.3.0
3) GCC/5.4.0-2.26 9) hwloc/1.11.10
4) OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1 10) openmpi/2.1.3
5) cuda/10.1.105 11) mkl/2017-beta
6) slurm/16.05.0
make test_fast gives a segmentation fault:
backtrace in gdb is as follows:
ABINIT 8.10.3
Give name for formatted input file:
testin_fast.in
Give name for formatted output file:
testin_fast.out
Give root name for generic input files:
testin_fast_i
Give root name for generic output files:
testin_fast_o
Give root name for generic temporary files:
testin_fast_tmp
Program received signal SIGSEGV, Segmentation fault.
_gfortrani_next_record (dtp=dtp@entry=0x7fffffff7a00, done=done@entry=1) at ../../../libgfortran/io/transfer.c:3505
3505 ../../../libgfortran/io/transfer.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libibverbs-41mlnx1-OFED.4.3.2.1.6.43302.x86_64 libmlx4-41mlnx1-OFED.4.1.0.1.0.43302.x86_64 libmlx5-41mlnx1-OFED.4.3.2.0.0.43302.x86_64 libnl3-3.2.28-4.el7.x86_64 libpciaccess-0.14-1.el7.x86_64 librdmacm-41mlnx1-OFED.4.2.0.1.3.43302.x86_64 librxe-41mlnx1-OFED.4.1.0.1.7.43302.x86_64 munge-libs-0.5.11-3.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64
(gdb) [dgx03:50755] 2 more processes have sent help message help-mpi-btl-openib.txt / default subnet prefix
[dgx03:50755] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
(gdb) bt
#0 _gfortrani_next_record (dtp=dtp@entry=0x7fffffff7a00, done=done@entry=1) at ../../../libgfortran/io/transfer.c:3505
#1 0x00002aaaaabba3f3 in finalize_transfer (dtp=dtp@entry=0x7fffffff7a00) at ../../../libgfortran/io/transfer.c:3616
#2 0x00002aaaaabba589 in _gfortran_st_write_done (dtp=0x7fffffff7a00) at ../../../libgfortran/io/transfer.c:3747
#3 0x000000000148a66e in m_errors::msg_hndl (
message=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>,
level=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>,
mode_paral=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>,
file=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>, line=<optimized out>,
nodump=<optimized out>, nostop=<optimized out>, _message=<optimized out>, _level=<optimized out>, _mode_paral=<optimized out>,
_file=<optimized out>) at m_errors.F90:901
could you please help resolve this?
thanks,
Naga