Parallel run error

option, parallelism,...

Moderators: fgoudreault, mcote

Forum rules
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
Locked
anemonekgo
Posts: 21
Joined: Tue Sep 22, 2015 3:54 am

Parallel run error

Post by anemonekgo » Mon Dec 21, 2015 5:09 am

I'm always thankful for the help.
Now , I'm creating a parallel computer by myself such as to the bottom (diskless ones).
Then it's not going well. Please help me again.

When I try following operate by the next command at abinit , an error occurs at the time of two or more nodes.

$ mpirun -n 4 abinit < BaTiO3.files > BaTiO3.log # Only one node run case
=>no problem ~ caluculation complete(about 3min).

$ mpirun -n 12 -machinefile node_3 abinit < BaTiO3.files > BaTiO3.log # Two or more nodes run case
=>following error message

#node_3
node00 cpu=4
node01 cpu=4
node02 cpu=4

=error message===================================================================================================
root@anemonekgo:/home/$user/BaTiO3_n12_mf# mpirun -n 12 -machinefile node_3 abinit < BaTiO3.files > BaTiO3.log
The authenticity of host 'node01 (192.168.0.101)' can't be established.
ECDSA key fingerprint is a6:bb:67:d8:5a:55:b0:47:03:57:bd:69:c7:6a:7f:04.
Are you sure you want to continue connecting (yes/no)? bash: orted: command not found #It's a moment, so I can't input with yes.
--------------------------------------------------------------------------
A daemon (pid 24085) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
===========================================================================================

a)By the way, I can have access between nodes without a password by ssh.

b)I exported the following path, but produced the same error.

$ export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH
$ echo $LD_LIBRARY_PATH
/usr/local/openmpi/lib:/lib:
$ mpirun -n 12 -machinefile node_3 abinit < BaTiO3.files > BaTiO3.log
=>same error

I would appreciate it if you would give me some advice.

Best regard,

anemonekgo
(Haruyuki satou)




============================================================================================
~~~~~~~~~~~~~~~~~~~~~~~~~~~
additional information
~~~~~~~~~~~~~~~~~~~~~~~~~~~

=Diskless(PXEboot) cluster system===========================================================
abinit-7.10.4/OS:ubuntu14.04 amd64
-master node :[core i7 4790k(4core)/RAM 16Gb/ASUS H97Plus]x15
-clients node :[core i7 4790k(4core)/RAM 16Gb/ssd 512Gb/ASUS H97Plus/intel NIC]x1

server stting:tftpd-hpa syslinux nfs-kernel-server isc-dhcp-server openssh-server NIS-serever

=============================================================================================



***Compiling abinit-7.10.4 **********************************************************
I compiled it in reference to the next recipe of abinit forum.

~Recipe to compile abinit 7.8.2 on UBUNTU 14.04 (64bits) ~
viewtopic.php?f=2&t=2807

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#configure infomation
$ gedit ubuntu.ac
-----------------------------------------------------------------------------
prefix="/usr/local"
enable_mpi="yes"
enable_mpi_io="yes"
with_mpi_prefix="/usr/local/openmpi"
with_trio_flavor="netcdf+etsf_io"
#with_netcdf_incs="-I/usr/include"
#with_netcdf_libs="-L/usr/lib -lnetcdf -lnetcdff"
with_fft_flavor="fftw3"
with_fft_incs="-I/usr/local/fftw/3.3.4/include"
with_fft_libs="-L/usr/local/fftw/3.3.4/lib -lfftw3 -lfftw3f"
with_linalg_flavor="atlas"
with_linalg_libs="-L/usr/lib -llapack -lf77blas -lcblas -latlas"
with_dft_flavor="atompaw+bigdft+libxc+wannier90"
#with_dft_flavor="atompaw+libxc"
enable_gw_dpc="yes"
#enable_openmp="yes"
-----------------------------------------------------------------------------
$ ./configure --with-config-file="./ubuntu.ac"

==============================================================================
=== Final remarks ===
==============================================================================
Summary of important options:

* C compiler : gnu version 4.8
* Fortran compiler: gnu version 4.8
* architecture : unknown unknown (64 bits)

* debugging : basic
* optimizations : standard

* OpenMP enabled : no (collapse: ignored)
* MPI enabled : yes
* MPI-IO enabled : yes
* GPU enabled : no (flavor: none)

* TRIO flavor = netcdf+etsf_io-fallback
* TIMER flavor = abinit (libs: ignored)
* LINALG flavor = atlas (libs: user-defined)
* ALGO flavor = none (libs: ignored)
* FFT flavor = fftw3 (libs: user-defined)
* MATH flavor = none (libs: ignored)
* DFT flavor = libxc-fallback+atompaw-fallback+bigdft-fallback+wannier90-fallback

Configuration complete.
You may now type "make" to build ABINIT.
(or, on a SMP machine, "make mj4", or "make multi multi_nprocs=<n>")
--------------------------------------------------------------------------
$ make mj4
$ make install

$ cd tests
$ ./runtests.py -j 4 fast

Regenerating database...
Saving database to file /home/appwg-3/abinit-7.10.4/tests/test_suite.cpkl
Running ntests = 26, MPI_nprocs = 1, py_nthreads = 4...
[fast][t01][np=1]: succeeded
[fast][t29][np=1]: succeeded
Test suite completed in 7.51 s (average time for test = 1.65 s, stdev = 1.72 s)
failed: 0, succeeded: 10, passed: 1, skipped: 0, disabled: 0

$ ./runtests.py paral -n 2 -j 2
Test_suite directory already exists! Old files will be removed
Running ntests = 98, MPI_nprocs = 2, py_nthreads = 2...
Test suite completed in 77.42 s (average time for test = 1.63 s, stdev = 4.30 s)
failed: 0, succeeded: 14, passed: 6, skipped: 70, disabled: 0

==============================================
@cluster_tests structure:BaTiO3
~ecut=15
~ngkpt=8x8x8

attachment:BaTiO3.in
Attachments
BaTiO3.in
(1016 Bytes) Downloaded 276 times

User avatar
pouillon
Posts: 651
Joined: Wed Aug 19, 2009 10:08 am
Location: Spain
Contact:

Re: Parallel run error

Post by pouillon » Mon Dec 21, 2015 1:36 pm

This problem is not related to Abinit, but to your MPI installation. Please consult the corresponding MPI documentation and/or contact with the developers of the MPI implementation you're using.
Yann Pouillon
Simune Atomistics
Donostia-San Sebastián, Spain

anemonekgo
Posts: 21
Joined: Tue Sep 22, 2015 3:54 am

Re: Parallel run error

Post by anemonekgo » Wed Dec 23, 2015 10:03 am

Dear Pouillon

Thank you very much for your answer. I'll try then.
Please let me know if there is anything I can do for you in the future.

Best regards,
Haruyuki Satou

anemonekgo
Posts: 21
Joined: Tue Sep 22, 2015 3:54 am

Re: Parallel run error

Post by anemonekgo » Wed Dec 23, 2015 9:15 pm

Dear Pouillon

Sorry for taking your time again.

If you have an example and information easy to understand for making abinit parallel, could you tell me please?
(Previous forum topics, manuals?, etc.----Unfortunately, I was not able to understand it only in a tutorial)

I also have trouble getting information because I'm a beginner.

I'd appreciate your help.

sincerely,

anemonekgo
(Haruyuki Satou)

Locked