Abinit-601, BUG: chaining calculations, npw != npw1

Total energy, geometry optimization, DFT+U, spin....

Moderator: bguster

Locked
User avatar
torrent
Posts: 127
Joined: Fri Aug 14, 2009 7:40 pm

Abinit-601, BUG: chaining calculations, npw != npw1

Post by torrent » Wed Feb 24, 2010 1:04 pm

Max,

Thanks for those tests...

Now it's up to the developers.
Apparently, there is a memory problem when chaining different datasets in the prep_kpgio.F90 routine.
One piece of memory is not cleanly managed.

Meanwhile, the solution for you is to run your job in different datasets.
It should be possible by scripting it in sh or python or whatever you want...
Sorry for this...

Marc




Dear Marc,

[...]


> 2-This strange message with "ERRRR" means that a memory place (that should be deallocated) is not free.
> It could append in several pieces of code; to help us to solve your problem, we need more informations.
> Could you perform the following tests (separately):
>
> a) launch your calculation in two separate runs; you keep your input but...
> ...for the first run, you put "ndtset 1 jdtset 11".
> ...for the 2nd run, you put "ndtset 1 jdtset 12".
>

The calculations ran fine.



> b) use explicit values for npkpt, npfft, npband and bandpp.
> Reading your previous mail I guess that you let the code decide the distribution of procs.
> It could be problematic for the chain of calculation.
>
I've given the following values to these keywords:

npkpt 1
npfft 10
npband 30
bandpp 4

The separate calculations "ndtset 1 jdtset 11" and "ndtset 1 jdtset 12"
ran to completions.

This is not the case of the calculation "ndtset 2 jdtset 11 12" which fails
as previously with the message

<error_msg>
-P-0000 - newkpt: read input wf with ikpt,npw= 1 140, make ikpt,npw= 1 262
ERRRRRRRRRRR 262 42364
-P-0000
-P-0000 leave_new : decision taken to exit ...
-P-0000 leave_new : synchronization done...
-P-0000 leave_new : exiting...
Application 442458 exit codes: 1
</error_msg>


I hope this helps. Please do not hesitate to tell me if you
need any further information.

Best regards,
Max


>
> Marc
>
>
>
> Le 17.02.2010 11:45, Latévi Max LAWSON DAKU a écrit :
>> Hi Marc,
>>
>> The MPI version used is the Cray-XT MPT-4.0.0, based on MPICH2.
>> Quoting the information provided
>>
>> <quote>
>> *The MPICH2 version which this MPI is based was upgraded from version
>> 1.0.6p1 to 1.1.1p1 and contains the following main features:
>> - MPI 2.1 Standard support (except dynamic process management)
>> - MPI-IO supports MPI_Type_create_resized and MPI_Type_create_index_block datatypes
>> - Many bugfixes from ANL
>>
>> [...]
>> </quote>
>>
>> Here is the error message output by the MPI/IO- enabled version
>> while trying to chain the calculation (I guess this is related to the
>> mail of François regarding how the reading o fthe WFK may be
>> implemented?..)
>>
>> <error_msg>
>> [...]
>> -P-0000 - newkpt: read input wf with ikpt,npw= 1 140, make ikpt,npw= 1 262
>> ERRRRRRRRRRR 262 42364
>> -P-0000
>> -P-0000 leave_new : decision taken to exit ...
>> -P-0000 leave_new : synchronization done...
>> -P-0000 leave_new : exiting...
>> </error_msg>
>>
>>
>> Max
>>
>> /opt/cray/mpt/4.0.0/xt/seastar/mpich2-gnu/lib/44
>>
>>
>> On 17. 02. 10 11:11, TORRENT Marc wrote:
>>> Hi Boris,
>>>
>>> Do you use titane supercomputer with the BullXmpi version of MPI ?
>>> If yes, the problem has been identified... this is an issue due to the BullxMPI library based on openMPII-1.3.x.
>>> The problem seems to be solved by using openMPI-1.4.1 (directly, not using Bull packaging).
>>> The simpliest solution for you is to temporarily switch back to BullMPI2...
>>> I'm currently testing a version compiled with opemMPI-1.4.1, with the help of other users... stay tune...
>>>
>>> Max: what version of MPI do you use ?
>>>
>>> Marc
>>>
>>>
>>> Le 17.02.2010 10:13, DORADO Boris Thésard a écrit :
>>>>
>>>> Dear all,
>>>>
>>>>
>>>>
>>>> I Think I’m encountering the same issue with my calculations. I’m using Abinit 6 and the Parallel I/O is activated, as well as the parallelization over bands, kpoints and g-vectors.
>>>>
>>>>
>>>>
>>>> As in Max’s case, I couldn’t go through the first output either after setting accesswff=1. The calculation was kind of idle after (or while?) writing the wave functions. If I do not use accesswff=1, the calculation finishes properly, the wave function file is written but restarting from it does not work.
>>>>
>>>>
>>>>
>>>> Best regards
>>>>
>>>>
>>>>
>>>> Boris
>>>>
>>>> ------------------------------------------------------------------------------------------
>>>> Boris Dorado
>>>> CEA, DEN, DEC, Centre de Cadarache
>>>> Laboratoire des Lois de Comportement du Combustible, Bâtiment 130
>>>> 13108 Saint-Paul-lez-Durance, France
>>>> Tel : +33 - (0)4 42 25 61 93
>>>> Fax : +33 - (0)4 42 25 32 85
>>>> ------------------------------------------------------------------------------------------
>>>>
>>>> De : Latévi Max LAWSON DAKU [mailto:Max.Lawson@unige.ch]
>>>> Envoyé : mercredi 17 février 2010 10:06
>>>> À : forum@abinit.org
>>>> Objet : Re: [abinit-forum] Abinit-601, BUG: chaining calculations, npw != npw1
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 16. 02. 10 10:10, TORRENT Marc wrote:
>>>>
>>>> Dear Max
>>>>
>>>> Dear Marc,
>>>>
>>>> Many thanks for your kind e-mail.
>>>>
>>>>
>>>> I see this statement in your output file:
>>>> " Parallel I/O : no"
>>>>
>>>> If you want to perform WFK reading/writing with paral_kgb=1 enabled, the use of "parallel IO" is mandatory.
>>>> But I don't understand why the code does not complain about it (there should be a message saying: MPI-IO is needed for IO with paral_kgb=1).
>>>> The message was present in previous versions of the code.
>>>>
>>>>
>>>> I've had a look at the output. You are right: the message is still present.
>>>> I previously missed it by focussing on the "BUG" message.
>>>>
>>>> I did recompile with "--enable-mpi-io" and since it proved still impossible
>>>> to chain the calculations, I didn't go through the first output.
>>>>
>>>>
>>>> Best regards,
>>>> Max
>>>>
>>>>
>>>> So, you MUST compile with "--enable-mpi-io" ; then the keyword accesswff will be automatically set to 1 to activate parallel-io.
>>>>
>>>> Marc Torrent
>>>> CEA-Bruyeres-le-Chatel - France
>>>>
>>>>
>>>>
>>>> Le 02/14/10 11:19, Latévi Max LAWSON DAKU a écrit :
>>>>
>>>> Dear Abinit developers,
>>>>
>>>> I'm giving a try to Abinit-601. The compilation went fine.
>>>> But, while trying to chain SCF calculations, a bug showed
>>>> up on going from the first to the second dataset. Here is
>>>> the tail of the output with the error message:
>>>>
>>>> <output>
>>>> [..]
>>>> -P-0000 hdr_check: WARNING -
>>>> -P-0000 Restart of self-consistent calculation need translated wavefunctions.
>>>> -P-0000 Indeed, critical differences between current calculation and
>>>> -P-0000 restart file have been detected in:
>>>> -P-0000 * the plane-wave cutoff
>>>> -P-0000 ===============================================================
>>>> -P-0000 leave_test : synchronization done...
>>>> kpgio: loop on k-points done in parallel
>>>> -P-0000
>>>> -P-0000 Subroutine Unknown:0:BUG
>>>> -P-0000 Reading option of rwwf. One should have npw=npw1
>>>> -P-0000 However, npw= 140, and npw1= 144.
>>>> -P-0000
>>>> -P-0000 leave_new : decision taken to exit ...
>>>> </output>
>>>>
>>>>
>>>> The involved routine is rwwf() in src/59_io_mpi/rwwf.F90.
>>>> I've enabled parallelization with paral_kgb=1 et didn't used
>>>> MPI I/O routine. I've attached the relevant part of my input.
>>>>
>>>>
>>>> Best regards,
>>>> Max
>>>>
>>>>
>>>> P.S. Please find below the build information.
>>>>
>>>>
>>>> === Build Information ===
>>>> Version : 6.0.1
>>>> Build target : x86_64_linux_gnu4.4
>>>> Build date : 20100213
>>>>
>>>> === Compiler Suite ===
>>>> C compiler : gnu4.4
>>>> CFLAGS : -g -O2 -march=opteron
>>>> C++ compiler : gnu4.4
>>>> CXXFLAGS : -g -O2 -march=opteron
>>>> Fortran compiler : gnu4.4
>>>> FCFLAGS : -O3 -funroll-loops -ffast-math -march=barcelona
>>>> FC_LDFLAGS :
>>>>
>>>> === Optimizations ===
>>>> Debug level : yes
>>>> Optimization level : standard
>>>> Architecture : amd_opteron
>>>>
>>>> === MPI ===
>>>> Parallel build : yes
>>>> Parallel I/O : no
>>>>
>>>> === Linear algebra ===
>>>> Library type : abinit
>>>> Use ScaLAPACK : no
>>>>
>>>> === Plug-ins ===
>>>> BigDFT : yes
>>>> ETSF I/O : yes
>>>> LibXC : yes
>>>> FoX : no
>>>> NetCDF : yes
>>>> Wannier90 : yes
>>>>
>>>> === Experimental features ===
>>>> Bindings : no
>>>> Error handlers : no
>>>> Exports : no
>>>> GW double-precision : no
>>>> Macroave build : yes
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ***********************************************
>>>> Latevi Max LAWSON DAKU
>>>> Universite de Geneve - Sciences II
>>>> 30, quai Ernest-Ansermet
>>>> CH-1211 Geneve 4
>>>> Switzerland
>>>>
>>>> Tel: (41) 22/379 6548 ++ Fax: (41) 22/379 6103
>>>> ***********************************************
>>
>> --
>> ***********************************************
>> Latevi Max LAWSON DAKU
>> Universite de Geneve - Sciences II
>> 30, quai Ernest-Ansermet
>> CH-1211 Geneve 4
>> Switzerland
>>
>> Tel: (41) 22/379 6548 ++ Fax: (41) 22/379 6103
>> ***********************************************
>>

--
***********************************************
Latevi Max LAWSON DAKU
Universite de Geneve - Sciences II
30, quai Ernest-Ansermet
CH-1211 Geneve 4
Switzerland

Tel: (41) 22/379 6548 ++ Fax: (41) 22/379 6103
***********************************************
Marc Torrent
CEA - Bruyères-le-Chatel
France

Locked