About "signal 11 (Segmentation fault)" error when value of n

option, parallelism,...

Moderators: fgoudreault, mcote

Forum rules
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
Locked
anemonekgo
Posts: 21
Joined: Tue Sep 22, 2015 3:54 am

About "signal 11 (Segmentation fault)" error when value of n

Post by anemonekgo » Tue Mar 01, 2016 2:26 am

Subject: About "signal 11 (Segmentation fault)" error when value of natom is making to big.

Dear all,

I'm always thankful for the help.
Now , I'm creating a parallel computer by myself such as the bottom informations.(diskless ones and abinit configure info.)
Thanks to your advice it is about 80% was completed. However, we are suffering from the rest of the little.
Please help me again.

I understood that an error occur when I made natom big by a trial run of the parallel computer.


First it is a LAN traffic error, I was able to calculate without a problem when 10Gbps LAN card and switch change only master node.
By this operation, a calculation was possible to natom=396.
=================================================================================================================
Ru_1 natom    48  No problem calculation end by 1Gbps_LAN_system
Ru_2 natom    156  I can calculate without a problem when 10Gbps LAN card and switch change only master node.
Ru_3 natom    252  Same as above
Ru_4 natom    300 Same as above
Ru_5 natom    348 Same as above
Ru_6 natom    396  Same as above
Ru_7 natom    444  #error appears at $filename_o_DS1_TIM6.cif(DEN) and does Stop
Ru_8 natom    492  #error appears at $filename_o_DS1_TIM4.cif(DEN) and does Stop
=================================================================================================================

But when natom is made bigger, the following error occurs and calculation stops.
#log
================================================================================
----iterations are completed or convergence reached----

outwf: write wavefunction to file Run_8_o_DS1_WFK, with accesswff 1
--------------------------------------------------------------------------
mpirun noticed that process rank 28 with PID 0 on node 192.168.0.107 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

#terminal
================================================================================
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7fca9012e642
.
.
#13 0x2AE20DCBC2C5
#14 0x2AE20DCBC338
#15 0x2AE1FB0A1CCD
#16 0x2AE1F9283EB5
#17 0xEE399A in __m_wffile_MOD_wffreadwrite_mpio at m_wffile.F90:2054
#18 0xE3780B in writewf_ at rwwf.F90:935
#19 0xE38F0F in rwwf_ at rwwf.F90:163
#20 0x9BEF46 in outwf_ at outwf.F90:454
#21 0x4B1912 in gstate_ at gstate.F90:1266
#22 0x434857 in gstateimg_ at gstateimg.F90:426
#23 0x425BE4 in driver_ at driver.F90:645 (discriminator 2)
#24 0x41CCBA in abinit at abinit.F90:460
[node04][[58377,1],18][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node01][[58377,1],6][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node02][[58377,1],9][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
---------------------------------------------------------------------------------------------------------------------------
[node00:17529] 40 more processes have sent help message help-mpi-api.txt / mpi-abort
[node00:17529] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
---------------------------------------------------------------------------------------------------------------------------

I did not understand it well, but the error outstanding most was "signal 11 (Segmentation fault)".

Therefore I tried the following actions.

*1st. I tried limit cancellation of Stack size at Run_7.

$ ulimit -s unlimited

But, the result was the same.

*2nd. I tried to change the value of ecut at Run_7.

ecut=5 ,pawecutdg=20 : error appears at $filename_o_DS1_TIM2.cif(DEN) and does Stop
ecut=10,pawecutdg=20 : error appears at $filename_o_DS1_TIM6.cif(DEN) and does Stop
ecut=15,pawecutdg=30 : error appears at $filename_o_DS1_TIM8.cif(DEN) and does Stop
ecut=20,pawecutdg=40 : I am trying it, but a calculation is slow, and a result is not readily given now.


There are no changes by unlimited of Stack size, and I'm going to decrease more ecut and I was going to get the memory capacity.
But, a reverse result was obtained.

I'm in trouble because I don't know what has happened.

I am happy if I can ask for some advice.

Please help me again.

Best,regard,
Haruyuki Satou
(anemonekgo)
=====================================================================================================================
#Please pardon that a parameter is rough because the present gave priority to operation confirmation for a test run.

<Run_X.in>
------------------------------------------------
autoparal 1
ndtset 1

optcell 2
ionmov 3
tolmxf 5.0d-4
ntime 10
#getcell -1
#getxred -1
dilatmx 1.05
ecutsm 0.5

kptopt 3
nshiftk 1
shiftk 0.0 0.0 0.0
ngkpt 2 2 2
occopt 1

natom "Variable"
ntypat 5

ecut 10.0 #def.15
pawecutdg 20.0
pawspnorb 0
pawovlp 20
ixc 11

nstep 40
toldff 1.0d-6
diemac 12.0
optforces 1
prtcif 1
prtden 1
enunit 1
------------------------------------------------

<Run_X.files>
------------------------------------------------
Run_X.in
Run_X.out
Run_X_i
Run_X_o
Run_X0
atom1_pbe_v1_abinit.paw
atom2_pbe_v1_abinit.paw
atom3_pbe_v1.2_abinit.paw
atom4_pbe_v1.2_abinit.paw
atom5_pbe_v1_abinit.paw
------------------------------------------------

============================================================================================
additional information
~~~~~~~~~~~~~~~~~~~~~~~~~~~

===Diskless(PXEboot) cluster system=========================================================

abinit-7.10.4/OS:ubuntu14.04 amd64
-master node :[core i7 4790k(4core)/RAM 16Gb/ssd 512Gb/ASUS H97Plus/intel 1Gbpd NIC]x1 # =>intel 1Gbps NIC
-clients node :[core i7 4790k(4core)/RAM 16Gb/ASUS H97Plus]x15
-Switching hub:NETGEAR 1Gbps #=> NETGEAR 10Gbps x2 + 1Gbps x24

server stting:tftpd-hpa syslinux nfs-kernel-server isc-dhcp-server openssh-server NIS-serever

=============================================================================================

abinit-7.10.4
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
abinit configure/ atlas gcc48
□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
#configure infomation
$ gedit ubuntu.ac
-----------------------------------------------------------------------------------------------------------------------
prefix="/usr/local"
enable_mpi="yes"
enable_mpi_io="yes"
with_mpi_prefix="/usr/local/openmpi"
with_trio_flavor="netcdf+etsf_io+fox"
with_fft_flavor="fftw3"
with_fft_incs="-I/usr/local/fftw/3.3.4/include"
with_fft_libs="-L/usr/local/fftw/3.3.4/lib -lfftw3 -lfftw3_mpi -lfftw3_threads -lfftw3f -lfftw3f_mpi -lfftw3f_threads"
with_linalg_flavor="atlas"
with_linalg_libs="-L/usr/lib -llapack -lf77blas -lcblas -latlas -lblas"
with_dft_flavor="atompaw+bigdft+libxc+wannier90"
enable_gw_dpc="yes"
#enable_openmp="yes"
-----------------------------------------------------------------------------------------------------------------------
$ ./configure --with-config-file="./ubuntu.ac" fcflags_opt_67_common="-O2" fcflags_opt_77_lwf="-O2" fcflags_opt_wannier90="-O2" -enable-optim="safe"

==============================================================================
=== Final remarks ===
==============================================================================


Summary of important options:

* C compiler : gnu version 4.8
* Fortran compiler: gnu version 4.8
* architecture : unknown unknown (64 bits)

* debugging : basic
* optimizations : safe

* OpenMP enabled : no (collapse: ignored)
* MPI enabled : yes
* MPI-IO enabled : yes
* GPU enabled : no (flavor: none)

* TRIO flavor = netcdf-fallback+etsf_io-fallback+fox-fallback
* TIMER flavor = abinit (libs: ignored)
* LINALG flavor = atlas (libs: user-defined)
* ALGO flavor = none (libs: ignored)
* FFT flavor = fftw3 (libs: user-defined)
* MATH flavor = none (libs: ignored)
* DFT flavor = libxc-fallback+atompaw-fallback+bigdft-fallback+wannier90-fallback

Configuration complete.
You may now type "make" to build ABINIT.
(or, on a SMP machine, "make mj4", or "make multi multi_nprocs=<n>")

anemonekgo
Posts: 21
Joined: Tue Sep 22, 2015 3:54 am

Re: About "signal 11 (Segmentation fault)" error when value

Post by anemonekgo » Wed Mar 02, 2016 3:27 am

Dear all,
I revise a mention error.
Best,
============================================================================================
additional information
~~~~~~~~~~~~~~~~~~~~~~~~~~~
===Diskless(PXEboot) cluster system=========================================================

abinit-7.10.4/OS:ubuntu14.04 amd64
-master node :[core i7 4790k(4core/8T)/RAM 32Gb/ssd 512Gb/ASUS H97Plus/intel 1Gbpd NIC]x1 # =>intel 10Gbps NIC
-clients node :[core i7 4790k(4core/8T)/RAM 16Gb/ASUS H97Plus]x15
-Switching hub:NETGEAR 1Gbps x16 #=> NETGEAR 10Gbps x2 + 1Gbps x24

server stting:tftpd-hpa syslinux nfs-kernel-server isc-dhcp-server openssh-server NIS-serever

Locked