parallel test error with abinit 7.8.2

option, parallelism,...

Moderators: fgoudreault, mcote

Forum rules
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
Locked
weitong
Posts: 26
Joined: Mon Sep 27, 2010 5:16 am

parallel test error with abinit 7.8.2

Post by weitong » Sat Nov 21, 2015 5:28 am

with $ runtests.py -j32, I got 2 failed and too much skipped on paral. (I am using mpich3.1 and intel13.1)
See the summary:

Code: Select all

......
[libxc][t03][np=1]: failed: absolute error 0.2 > 0.11
[seq][tsv2_81][np=0]: Skipped: Build environment defines the CPP variable HAVE_MPI
[seq][tsv2_82][np=0]: Skipped: Build environment defines the CPP variable HAVE_MPI
[v67mbpt][t28][np=1]: failed: relative error 0.9148 > 0.5
......
Suite        failed  passed  succeeded  skipped  disabled  run_etime  tot_etime
atompaw           0       1          1        0         0      18.55      18.95
bigdft            0       3         19        0         0     496.67     499.73
built-in          0       0          7        0         0       9.28       9.90
etsf_io           0       0          7        0         0      16.31      16.94
fast              0       0         11        0         0      39.78      41.96
fox               0       1          1        0         0     102.39     102.62
gpu               0       0          0        4         0       0.00       0.00
libxc             1       4         16        0         0     236.67     242.86
mpiio             0       0          1       13         0       3.08       3.22
paral             0       6         15       68         0     478.35     485.57
seq               0       0          0       18         0       0.00       0.02
tutoparal         0       0          1        1         0       0.46       0.48
tutoplugs         0       4          0        0         0      18.86      19.38
tutorespfn        0       8         14        0         0    1258.46    1272.36
tutorial          0       7         41        0         0     883.58     896.89
unitary           0       0         21       13         0      91.92      92.75
v1                0       2         74        0         0     236.39     248.32
v2                0      12         67        0         0     297.54     313.69
v3                0      12         68        0         0     464.03     488.90
v4                0      17         45        0         0     498.61     520.28
v5                0      17         57        0         0    1003.78    1042.13
v6                0      13         47        0         0     787.03     829.73
v67mbpt           1       3         13        0         0     295.81     301.99
v7                0      13         16        0         0     616.51     628.66
vdwxc             0       0          1        0         0      13.41      13.47
wannier90         0       7          0        0         0      41.01      41.88
Test suite results in HTML format are available in Test_suite/suite_report.html


Then, I tried parallel test
$ runtests.py paral -n 2 -j 8
I got many errors:

Code: Select all

......
[paral][t71_MPI2][np=2]: fldiff.pl fatal error:
The diff analysis cannot be done: the number of lines to be analysed differ.
File /home/weitong/Program/abinit-7.8.2/tests/paral/Refs/t71_MPI2.out: 1131 lines, 82 ignored
File /home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/tests/Test_suite/paral_t71_MPI2/t71_MPI2.out: 1127 lines, 79 ignored
Command mpirun -np 2  /home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/src/98_main/abinit < /home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/tests/Test_suite/paral_t72_MPI2/t72_MPI2.stdin > /home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/tests/Test_suite/paral_t72_MPI2/t72_MPI2.stdout 2> /home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/tests/Test_suite/paral_t72_MPI2/t72_MPI2.stderr
 returned exit_code: 124

[paral][t72_MPI2][np=2]: fldiff.pl fatal error:
The diff analysis cannot be done: the number of lines to be analysed differ.
File /home/weitong/Program/abinit-7.8.2/tests/paral/Refs/t72_MPI2.out: 1173 lines, 62 ignored
File /home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/tests/Test_suite/paral_t72_MPI2/t72_MPI2.out: 1169 lines, 59 ignored
Command mpirun -np 2  /home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/src/98_main/abinit < /home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/tests/Test_suite/paral_t73_MPI2/t73_MPI2.stdin > /home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/tests/Test_suite/paral_t73_MPI2/t73_MPI2.stdout 2> /home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/tests/Test_suite/paral_t73_MPI2/t73_MPI2.stderr
 returned exit_code: 124
...........
Test suite completed in 918.08 s (average time for test = 62.00 s, stdev = 225.35 s)
failed: 6, succeeded: 11, passed: 3, skipped: 69, disabled: 0
[paral][t71_MPI2][np=2] has run_etime 900.01 s
[paral][t72_MPI2][np=2] has run_etime 900.01 s
[paral][t73_MPI2][np=2] has run_etime 900.01 s
[paral][t74_MPI2][np=2] has run_etime 900.01 s
[paral][t75_MPI2][np=2] has run_etime 900.01 s
[paral][t76_MPI2][np=2] has run_etime 900.01 s
Suite   failed  passed  succeeded  skipped  disabled  run_etime  tot_etime
paral        6       3         11       69         0    5517.71    5520.87
Test suite results in HTML format are available in Test_suite/suite_report.html
jobrunner.py


I tried the post viewtopic.php?f=2&t=2618 , change mpirun to mpiexec in jobrunner.py,

Code: Select all

......
    @classmethod
    #def generic_mpi(cls, ompenv=None, use_mpiexec=False, timebomb=None):
    def generic_mpi(cls, ompenv=None,use_mpiexec=True, timebomb=None):
        # It should work, provided that the shell environment is properly defined.
        d = {}
        d["mpirun_np"] = "mpiexec -np"
#        if use_mpiexec:
#            d["mpirun_np"] = "mpiexec -np"
#        else:
#            d["mpirun_np"] = "mpirun -np"
        #d["mpirun_extra_args"] = ""
        d["ompenv"] = ompenv
        d["timebomb"] = timebomb
        return cls(d)
.....


but the same errors with $ runtests.py paral -n 2 -j 8

Code: Select all

........
[paral][t94_MPI4][np=0]: Skipped: nprocs: 2 != nprocs_to_test: 4
nprocs: 2 in exclude_nprocs: [1, 2, 3]

[paral][t06_MPI2][np=2]: passed: absolute error 2e-05 < 0.011, relative error 0.0006127 < 0.0008        ------>stuck here for more than 10 minutes
[paral][t76_MPI2][np=2]: fldiff.pl fatal error:
Unable to open input file /home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/tests/Test_suite/paral_t76_MPI2/t76_MPI2o_DS3_EXC_MDF
/home/weitong/Program/abinit-7.8.2/tests/pymods/testsuite.py:2925: UserWarning: exception while creating tarball file: [Errno 2] No such file or directory: '/home/weitong/Program/abinit-7.8.2/tmp_mpich_intel13/tests/Test_suite/paral_t76_MPI2/t76_MPI2o_DS3_EXC_MDF'
  warn("exception while creating tarball file: %s" % str(exc))
Test suite completed in 918.34 s (average time for test = 62.01 s, stdev = 225.35 s)
failed: 6, succeeded: 11, passed: 3, skipped: 69, disabled: 0
[paral][t71_MPI2][np=2] has run_etime 900.01 s
[paral][t72_MPI2][np=2] has run_etime 900.01 s
[paral][t73_MPI2][np=2] has run_etime 900.01 s
[paral][t74_MPI2][np=2] has run_etime 900.01 s
[paral][t75_MPI2][np=2] has run_etime 900.01 s
[paral][t76_MPI2][np=2] has run_etime 900.01 s
Suite   failed  passed  succeeded  skipped  disabled  run_etime  tot_etime
paral        6       3         11       69         0    5519.16    5522.46
Test suite results in HTML format are available in Test_suite/suite_report.html

Locked