parallel_run error of abinit 3  [SOLVED]

option, parallelism,...

Moderators: fgoudreault, mcote

Forum rules
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
Locked
anemonekgo
Posts: 21
Joined: Tue Sep 22, 2015 3:54 am

parallel_run error of abinit 3

Post by anemonekgo » Mon Jan 11, 2016 10:01 pm

Dear all,

I'm always thankful for the help.
Continuously, I'm trying for parallelization of abinit from the end of last year.
I got advice in this forum twice so far to come on today's stage.

My former question was about error of the "libraries" when having started abinit by parallelization.
Thanks to your advice, I was able to disappear "libraries errors"that well.

#Because the record of the past exchanges was a long sentence, I made the following links
==New questions; parallel_run error of abinit < viewtopic.php?f=2&t=3171 >

When it was over, the following new error message occurred.

$ mpirun -np 8 -machinefile /home/appwg-3/my_hosts abinit < BaTiO3.files > BaTiO3.log
root@node00:/home/appwg-3/BaTiO3_n8_cluster# mpirun -np 8 -machinefile /home/appwg-3/my_hosts abinit < BaTiO3.files > BaTiO3.log

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:
#0 0x7f1e27e2f642
#1 0x7f1e27e2fd7e
#2 0x7f1e2756bd3f
#3 0x11e4a4f
#4 0xdb6a3c
#5 0xdac65e
#6 0xd7e8e1
#7 0xd24655
#8 0x478548
#9 0x47814c
#0 0x7fada22d5642
#1 0x7fada22d5d7e
#2 0x7fada1a11d3f
#3 0x11e4a4f
#4 0xdb6a3c
#5 0xdac65e
#6 0xd7e8e1
#7 0xd24655
#8 0x478548
#9 0x47814c
#10 0x7fada19fcec4
#11 0x47817c
#12 0xffffffffffffffff
#10 0x7f1e27556ec4
#11 0x47817c
#12 0xffffffffffffffff

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:
#0 0x7f7a4b4bc642
#1 0x7f7a4b4bcd7e
#2 0x7f7a4abf8d3f
#3 0x11e4a4f
#4 0xdb6a3c
#5 0xdac65e
#6 0xd7e8e1
#7 0xd24655
#8 0x478548
#9 0x47814c
#10 0x7f7a4abe3ec4
#11 0x47817c
#12 0xffffffffffffffff

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:
#0 0x7f7bb9b29642
#1 0x7f7bb9b29d7e
#2 0x7f7bb9265d3f
#3 0x11e4a4f
#4 0xdb6a3c
#5 0xdac65e
#6 0xd7e8e1
#7 0xd24655
#8 0x478548
#9 0x47814c
#10 0x7f7bb9250ec4
#11 0x47817c
#12 0xffffffffffffffff
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 3539 on node node01 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------


*Please refer to Log of attachment.
------------------------------------------
#my_hosts
192.168.0.100 node00 slots=4 max-slots=4
192.168.0.101 node01 slots=4 max-slots=4
------------------------------------------

At this point in time I do not know what the cause is.

Would you tell me advice about this cause?

Please help me again.

Best regards,
anmonekgo
(Haruyuki Satou)
Attachments
2_BaTiO3.log
(7.93 KiB) Downloaded 440 times
BaTiO3.in
(1016 Bytes) Downloaded 448 times

User avatar
pouillon
Posts: 651
Joined: Wed Aug 19, 2009 10:08 am
Location: Spain
Contact:

Re: parallel_run error of abinit 3

Post by pouillon » Wed Jan 13, 2016 1:05 pm

This happens because the node on which you're running Abinit has a different architecture than the one on which you built it.

Rebuilding Abinit with optimizations compatible with the run-time architecture shoudl solve your problem.
Yann Pouillon
Simune Atomistics
Donostia-San Sebastián, Spain

anemonekgo
Posts: 21
Joined: Tue Sep 22, 2015 3:54 am

Re: parallel_run error of abinit 3

Post by anemonekgo » Thu Jan 14, 2016 11:12 pm

Dear Pouillon,

Thank you for your prompt reply.
I am going to study the run-time architecture now for and examine the method that you suggest.
If a result of the examination is obtained, I am going to report it.

With many thanks,
anemonekgo
(Haruyuki Satou)

anemonekgo
Posts: 21
Joined: Tue Sep 22, 2015 3:54 am

Re: parallel_run error of abinit 3  [SOLVED]

Post by anemonekgo » Mon Jan 25, 2016 7:16 am

Dear Pouillon,
CC:Dear all,


I'm always thankful for the help.

The advice from you was very helpful.

As a result of trying the following optimization based on your advice, all the 64cpu was able to run.

@optimization of the abinit configure options

0) ./configure --with-config-file="./ubuntu.ac"                 #base#
=>Terminated with the above error

1)+ fcflags_opt_67_common="-O2" fcflags_opt_77_lwf="-O2" fcflags_opt_wannier90="-O2"
=>Terminated with the same error

2)+ fcflags_opt_67_common="-O2" fcflags_opt_77_lwf="-O2" fcflags_opt_wannier90="-O2" --enable-optim="no" --enable-debug="enhanced"
=>All the CPUs ran simultaneously! But calculation speed was slow(about 2x).

3)+ fcflags_opt_67_common="-O2" fcflags_opt_77_lwf="-O2" fcflags_opt_wannier90="-O2" --enable-debug="enhanced"
=>All the CPUs ran simultaneously! But calculation speed was slow(about 2x).

4)+ fcflags_opt_67_common="-O2" fcflags_opt_77_lwf="-O2" fcflags_opt_wannier90="-O2" --enable-optim="no"
=>All the CPUs ran simultaneously! But calculation speed was slow(about 2x).

5)+ fcflags_opt_67_common="-O2" fcflags_opt_77_lwf="-O2" fcflags_opt_wannier90="-O2" -enable-optim="safe"
=>All the CPUs ran simultaneously! And normal speed end!


Thanks to you, I was able to run 16node(x4cpu).

But there are cases when It occasionally stops calculation by the following error.

The mpirun command that I use;

$ mpirun --allow-run-as-root -np 64 -machinefile my_hosts abinit < BaTiO3.files > BaTiO3.log

It seems to be communication fault [ of one node ] ?. ~~~hostname (IP-addr.) isn't same each time.

=======================================================================================================

ITER STEP NUMBER 2
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 2
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

hostname: 192.168.0.101

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

--------------------------------------------------------------------------
=======================================================================================================

This is investigating the cause at present.

I would appreciate it if you would give me some advice.

Best regards,
anemonekgo
(Satou Haruyuki)


=====================================================================================================================
@Remarks

#ubuntu.ac
--------------------------------------------------------------------------------------------------------------------
prefix="/usr/local"
enable_mpi="yes"
enable_mpi_io="yes"
with_mpi_prefix="/usr/local/openmpi"
#with_trio_flavor="netcdf+etsf_io"
with_trio_flavor="netcdf+etsf_io+fox"
#with_netcdf_incs="-I/usr/local/netcdf-4.3.3/gcc49/include"
#with_netcdf_libs="-L/usr/local/netcdf-4.3.3/gcc49/lib -lnetcdf -lnetcdff"
with_fft_flavor="fftw3"
with_fft_incs="-I/usr/local/fftw/3.3.4/include"
with_fft_libs="-L/usr/local/fftw/3.3.4/lib -lfftw3 -lfftw3_mpi -lfftw3_threads -lfftw3f -lfftw3f_mpi -lfftw3f_threads"
with_linalg_flavor="atlas"
with_linalg_libs="-L/usr/lib -llapack -lf77blas -lcblas -latlas"
#with_linalg_flavor="netlib"
#with_linalg_libs="-L/usr/local/lib -llapack -lblas"
#with_linalg_flavor="netlib+scalapack"
#with_linalg_libs="-L/usr/local/lib -lscalapack -llapack -lblas"
with_dft_flavor="atompaw+bigdft+libxc+wannier90"
#with_dft_flavor="atompaw+libxc"
enable_gw_dpc="yes"
#enable_openmp="yes"
-----------------------------------------------------------------------------------------------------------------------
#my_hosts
-------------------------------------------------
192.168.0.100 node00 slots=4 #master node
192.168.0.101 node01 slots=4
192.168.0.102 node02 slots=4
192.168.0.103 node03 slots=4
192.168.0.104 node04 slots=4
192.168.0.105 node05 slots=4
192.168.0.106 node06 slots=4
192.168.0.107 node07 slots=4
192.168.0.108 node08 slots=4
192.168.0.109 node09 slots=4
192.168.0.110 node10 slots=4
192.168.0.111 node11 slots=4
192.168.0.112 node12 slots=4
192.168.0.113 node13 slots=4
192.168.0.114 node14 slots=4
192.168.0.115 node15 slots=4
----------------------------------------------------

anemonekgo
Posts: 21
Joined: Tue Sep 22, 2015 3:54 am

Re: parallel_run error of abinit 3

Post by anemonekgo » Thu Feb 04, 2016 1:56 am

Dear all,

I'm always thankful for the help.
The other day, I described on my cluster error problems.
Then I was able to found the network error and the communication error between the nodes which were under investigation.

I'm going to report as a memo.

Stated simply, my cluster was crashing for the cause of an error by the traffic jam of LAN.

This error is caused when greater "natom" structures.

As a countermeasure;
I did change the Network interfaces.
------------------------------------------------------------------------------------------------------------
$ Master node00 NIC :1Gbps ==>> 10Gbps          #Intel card     
$ LAN switching hub :1Gbpsx16 ==>> 10Gbpsx2+1Gbpsx24 #NETGEAR
$ Clients node01-15 :1Gbpsx15 ==>> ~leave~ #ASUS onboad LAN
------------------------------------------------------------------------------------------------------------
I could be confirmed until "natom ~300" is running in the current situation with "np = 64".

PS
The truth is that I would like to change all Network interfaces into 10 Gbps, but it is very expensive.
Therefore I'm going to see a situation at the present setting.
Moreover, the value in connection with communication of "net.core.somaxconn etc" is also enlarged.
But, essential solution was not completed although the tendency which the setting improves was seen.

Best regards,
anmonekgo
(Haruyuki Satou)

Locked