New questions; parallel_run error of abinit

option, parallelism,...

Moderators: fgoudreault, mcote

Forum rules
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
Locked
anemonekgo
Posts: 21
Joined: Tue Sep 22, 2015 3:54 am

New questions; parallel_run error of abinit

Post by anemonekgo » Wed Dec 30, 2015 7:24 am

I'm always thankful for the help.

I'm creating a parallel computer by myself such as to the bottom (diskless ones). Then it's not still going well.

Therefore I asked you a question in this forum [Mon Dec 21, 2015] and had the following advice

>When I try following operate by the next command at abinit , an error occurs at the time of two or more nodes.

==>>>Please consult the corresponding MPI documentation and/or contact with the developers of the MPI implementation you're using.

And I think that I was able to solve a problem of MPI at present through investigation and forum of MPI. (#ref. bellow resalts)

The problem of mpirun(openmpi-1.6.5) was settled by changing that the public(open) directory of the NFS server was {/home+/usr/local} to {/home+/usr}.

However, the following mpirun errors(New type) continue still in abinit.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
$ mpirun -n 8 -machinefile /home/$NAME/my_hosts abinit < BaTiO3.files > BaTiO3.log
=>error stop

abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I try it in various ways, but cannot be settled.

---->For example, it is ... including export LD_LIBRARY_PATH ....Please see below.

At the time of abinit's cluster setting, must I place abinit and associated program in some same place(directry)?

Does cluster computing in Abinit need some special technique?

Please help me again.

Best regards,
anmonekgo
(Haruyuki Satou)

The following report is a long article, but please permit.
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Abinit cluster setting

#==>>After MPI setting fix
=============================================================
Openmpi setteig
=============================================================
#edit
/etc/openmpi/openmpi-default-hostfile
node00 slots=4
node01 slots=4


/home/$NAME/my_hosts
node00 slots=4
node01 slots=4

--------------------------------------------------------------
OpenMpi_tests resalts ====>>>>>>review~
--------------------------------------------------------------
[Openmpi_test1]
#hello.c #->ref.attachments
$ mpicc hello.c
$ mpirun -np 4 ./a.out
$ mpirun -np 8 --host 192.168.0.100,192.168.0.101 ./a.out
$ mpirun -np 8 --host node00,node01 ./a.out
$ mpirun -np 8 --hostfile /home/$NAME/my_hosts ./a.out
$ mpirun -np 8 -machinefile /home/$NAME/my_hosts ./a.out
$ mpirun -n 8 -machinefile my_hosts ./a.out

==>all command was normal end (1xnode ~15xnodes with mpirun )


[Openmpi_test2]
#test.cpp #->ref.attachments
$ mpic++ -W -Wall test.cpp -o test.o
$ mpirun -np 4 test.o
.#######same abobe test1 comand
.
$ mpirun -n 8 -machinefile my_hosts test.o

==>all command was normal end(1xnode ~15xnodes with mpirun)


It was the same result that I do mpirun_test mentioned above on all nodes of cluster(At the same time all:max.=15nodes).

I'm going to think these results are that implies running all nodes of cluster ??


|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
=============================================================
abinit cluster ~mpirun tests
>>>I report 2x nodes case to you for simplify.
=============================================================
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

#BaTiO.in #->ref.attachments
[mpirun-only one node_test]
$ cd /home/$NAME/abinit_Parallel/20151230_2/1_BaTiO3_mpi_8_node
$ mpirun -np 4 --host node00 abinit < BaTiO3.files > BaTiO3.log
==> normal end (3min3sec)

[mpirun--cluster~ 2 nodes_test]
$ mpirun -np 8 --host node00,node01 abinit < BaTiO3.files > BaTiO3.log
=>error stop
------------------------------------------------------------
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory

$ mpirun -np 8 --host 192.168.0.100,192.168.0.101 abinit < BaTiO3.files > BaTiO3.log
=>error stop
------------------------------------------------------------
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory

$ mpirun -n 8 -machinefile /home/$NAME/my_hosts abinit < BaTiO3.files > BaTiO3.log
=>error stop
------------------------------------------------------------
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Then, I tried "export $PATH" refer to next forum comment.


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>#binit forum Re: job submission on HPC [SOLVED]
> Postby pouillon » Thu May 02, 2013 10:26 am
>
>You need to configure LD_LIBRARY_PATH to point to the libraries you built Abinit with, e.g.: Code:>
>
> $ export LD_LIBRARY_PATH="/path/to/lapack/lib:$LD_LIBRARY_PATH"
>
>You'll have to do it for each library your cluster complains about.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#And I tried abinit parallel again. And I tried abinit parallel again.


$ mpirun -n 8 -machinefile /home/$NAME/my_hosts abinit < BaTiO3.files > BaTiO3.log
=>error stop
------------------------------------------------------------
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory

======>>> Same errors occurs.


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#fathermore, I serch liblapack.so.3's places with "dpkg -L $Package" command.

$ dpkg -L libatlas-base-dev
$ dpkg -L liblapack-dev
/usr/lib/liblapack.so.3
/usr/lib/lapack/liblapack.so.3

#I try
$ export PATH=/usr/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/lib:$LD_LIBRARY_PATH
$ export LD_LIBRARY_PATH=/usr/lib/lapack:$LD_LIBRARY_PATH
$ export PATH=/usr/local/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
$ export MANPATH=/usr/local/share/man:$MANPATH
$ export MANPATH=/usr/share/man:$MANPATH

#I check
$ echo $LD_LIBRARY_PATH #==>>liblapack.so.3 Path makes OK
$ echo $PATH #OK
$ echo $MANPATH #OK


#And I tried abinit parallel again too.

$ mpirun -n 8 -machinefile /home/$NAME/my_hosts abinit < BaTiO3.files > BaTiO3.log
=>error stop
------------------------------------------------------------
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
abinit: error while loading shared libraries: liblapack.so.3: cannot open shared object file: No such file or directory
======>>> Same errors occurs.

210151230_reports end
=======================================================================================================================
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






below 210151221_reports
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
=================================================================================================================================================
=================================================================================================================================================
Parallel run error
Postby anemonekgo » Mon Dec 21, 2015 6:09 am

I'm always thankful for the help.
Now , I'm creating a parallel computer by myself such as to the bottom (diskless ones).
Then it's not going well. Please help me again.

When I try following operate by the next command at abinit , an error occurs at the time of two or more nodes.

$ mpirun -n 4 abinit < BaTiO3.files > BaTiO3.log # Only one node run case
=>no problem ~ caluculation complete(about 3min).

$ mpirun -n 12 -machinefile node_3 abinit < BaTiO3.files > BaTiO3.log # Two or more nodes run case
=>following error message

#node_3
node00 cpu=4
node01 cpu=4
node02 cpu=4

=error message===================================================================================================
root@anemonekgo:/home/$user/BaTiO3_n12_mf# mpirun -n 12 -machinefile node_3 abinit < BaTiO3.files > BaTiO3.log
The authenticity of host 'node01 (192.168.0.101)' can't be established.
ECDSA key fingerprint is a6:bb:67:d8:5a:55:b0:47:03:57:bd:69:c7:6a:7f:04.
Are you sure you want to continue connecting (yes/no)? bash: orted: command not found #It's a moment, so I can't input with yes.
--------------------------------------------------------------------------
A daemon (pid 24085) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
===========================================================================================

a)By the way, I can have access between nodes without a password by ssh.

b)I exported the following path, but produced the same error.

$ export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH
$ echo $LD_LIBRARY_PATH
/usr/local/openmpi/lib:/lib:
$ mpirun -n 12 -machinefile node_3 abinit < BaTiO3.files > BaTiO3.log
=>same error

I would appreciate it if you would give me some advice.

Best regard,

anemonekgo
(Haruyuki satou)




============================================================================================
~~~~~~~~~~~~~~~~~~~~~~~~~~~
additional information
~~~~~~~~~~~~~~~~~~~~~~~~~~~

=Diskless(PXEboot) cluster system===========================================================
abinit-7.10.4/OS:ubuntu14.04 amd64
-master node :[core i7 4790k(4core)/RAM 16Gb/ASUS H97Plus]x15
-clients node :[core i7 4790k(4core)/RAM 16Gb/ssd 512Gb/ASUS H97Plus/intel NIC]x1

server stting:tftpd-hpa syslinux nfs-kernel-server isc-dhcp-server openssh-server NIS-serever

=============================================================================================



***Compiling abinit-7.10.4 **********************************************************
I compiled it in reference to the next recipe of abinit forum.

~Recipe to compile abinit 7.8.2 on UBUNTU 14.04 (64bits) ~
viewtopic.php?f=2&t=2807

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#configure infomation
$ gedit ubuntu.ac
-----------------------------------------------------------------------------
prefix="/usr/local"
enable_mpi="yes"
enable_mpi_io="yes"
with_mpi_prefix="/usr/local/openmpi"
with_trio_flavor="netcdf+etsf_io"
#with_netcdf_incs="-I/usr/include"
#with_netcdf_libs="-L/usr/lib -lnetcdf -lnetcdff"
with_fft_flavor="fftw3"
with_fft_incs="-I/usr/local/fftw/3.3.4/include"
with_fft_libs="-L/usr/local/fftw/3.3.4/lib -lfftw3 -lfftw3f"
with_linalg_flavor="atlas"
with_linalg_libs="-L/usr/lib -llapack -lf77blas -lcblas -latlas"
with_dft_flavor="atompaw+bigdft+libxc+wannier90"
#with_dft_flavor="atompaw+libxc"
enable_gw_dpc="yes"
#enable_openmp="yes"
-----------------------------------------------------------------------------
$ ./configure --with-config-file="./ubuntu.ac"

==============================================================================
=== Final remarks ===
==============================================================================
Summary of important options:

* C compiler : gnu version 4.8
* Fortran compiler: gnu version 4.8
* architecture : unknown unknown (64 bits)

* debugging : basic
* optimizations : standard

* OpenMP enabled : no (collapse: ignored)
* MPI enabled : yes
* MPI-IO enabled : yes
* GPU enabled : no (flavor: none)

* TRIO flavor = netcdf+etsf_io-fallback
* TIMER flavor = abinit (libs: ignored)
* LINALG flavor = atlas (libs: user-defined)
* ALGO flavor = none (libs: ignored)
* FFT flavor = fftw3 (libs: user-defined)
* MATH flavor = none (libs: ignored)
* DFT flavor = libxc-fallback+atompaw-fallback+bigdft-fallback+wannier90-fallback

Configuration complete.
You may now type "make" to build ABINIT.
(or, on a SMP machine, "make mj4", or "make multi multi_nprocs=<n>")
--------------------------------------------------------------------------
$ make mj4
$ make install

$ cd tests
$ ./runtests.py -j 4 fast

Regenerating database...
Saving database to file /home/appwg-3/abinit-7.10.4/tests/test_suite.cpkl
Running ntests = 26, MPI_nprocs = 1, py_nthreads = 4...
[fast][t01][np=1]: succeeded
[fast][t29][np=1]: succeeded
Test suite completed in 7.51 s (average time for test = 1.65 s, stdev = 1.72 s)
failed: 0, succeeded: 10, passed: 1, skipped: 0, disabled: 0

$ ./runtests.py paral -n 2 -j 2
Test_suite directory already exists! Old files will be removed
Running ntests = 98, MPI_nprocs = 2, py_nthreads = 2...
Test suite completed in 77.42 s (average time for test = 1.63 s, stdev = 4.30 s)
failed: 0, succeeded: 14, passed: 6, skipped: 70, disabled: 0

==============================================
@cluster_tests structure:BaTiO3
~ecut=15
~ngkpt=8x8x8

attachment:BaTiO3.in
ATTACHMENTS
BaTiO3.in
(1016 Bytes) Downloaded 3 times
anemonekgo

=================================================================================================================================================

Posts: 9
Joined: Tue Sep 22, 2015 3:54 am
Top
Re: Parallel run error
Postby pouillon » Mon Dec 21, 2015 2:36 pm

This problem is not related to Abinit, but to your MPI installation. Please consult the corresponding MPI documentation and/or contact with the developers of the MPI implementation you're using.
Yann Pouillon
Universidad del País Vasco UPV/EHU
Donostia-San Sebastián, Spain
User avatar
pouillon

Posts: 604
Joined: Wed Aug 19, 2009 10:08 am
Location: Spain


=================================================================================================================================================
Top
Re: Parallel run error
Postby anemonekgo » Wed Dec 23, 2015 11:03 am

Dear Pouillon

Thank you very much for your answer. I'll try then.
Please let me know if there is anything I can do for you in the future.

Best regards,
Haruyuki Satou
anemonekgo

Posts: 9
Joined: Tue Sep 22, 2015 3:54 am
Top
Re: Parallel run error
Postby anemonekgo » Wed Dec 23, 2015 10:15 pm

Dear Pouillon

Sorry for taking your time again.

If you have an example and information easy to understand for making abinit parallel, could you tell me please?
(Previous forum topics, manuals?, etc.----Unfortunately, I was not able to understand it only in a tutorial)

I also have trouble getting information because I'm a beginner.

I'd appreciate your help.

sincerely,

anemonekgo
(Haruyuki Satou)
=================================================================================================================================================
Attachments
BaTiO3.in
(1.06 KiB) Downloaded 397 times

User avatar
pouillon
Posts: 651
Joined: Wed Aug 19, 2009 10:08 am
Location: Spain
Contact:

Re: New questions; parallel_run error of abinit

Post by pouillon » Wed Dec 30, 2015 10:12 am

Properly setting a cluster can be very tricky. In your case, you have to make sure that all libraries you link Abinit with are available on *all* the nodes. A typical mistake is to set LD_LIBRARY_PATH on the front-end node but not on the other ones.

This goes however way beyond just setting Abinit, and we can provide you with very limited help. If you still encounter problems, you should ask directly to people who are mounting clusters as you do. They will give you a lot of useful insights and tips.

Update: I moved this topic to platform-specific questions, which fits better its content.
Yann Pouillon
Simune Atomistics
Donostia-San Sebastián, Spain

anemonekgo
Posts: 21
Joined: Tue Sep 22, 2015 3:54 am

Re: New questions; parallel_run error of abinit

Post by anemonekgo » Wed Dec 30, 2015 11:18 am

Dear Pouillon

Thanks for your quick reply.

I understood well that setting of LD_LIBRARY_PATH was important.

I arrange it once again and make an effort.

Unfortunately information setting abinit in diskless cluster is not found in WEB while I checked it.

If knowing related information and/or people, I should be very much obliged if you can tell.

Even slight information I was glad. 

I'd like to question that people.

sincerely,

anemonekgo
(Haruyuki Satou)

Locked