[solved] anormal(?) memory utilization while writing WFK

Total energy, geometry optimization, DFT+U, spin....

Moderator: bguster

Locked
Levesque
Posts: 2
Joined: Wed Mar 24, 2010 8:56 pm

[solved] anormal(?) memory utilization while writing WFK

Post by Levesque » Wed Mar 24, 2010 9:27 pm

Hi everyone, thanks in advance for your advices.

My KSS calculation ended while writing the WFK file (this is the tail of the .log file) :

Code: Select all

 ----iterations are completed or convergence reached----

 outwf  : write wavefunction to file pSSxo_DS1_WFK
-P-0000  leave_test : synchronization done...
-----


The system admin told me that the memory utilization increase dramatically in one processor while writing the file, while other processors of the node remain unutilized (this is normal since I use one proc. per node). This is the error message from the system :

Code: Select all

[[32890,1],0][btl_openib_component.c:2948:handle_wc] from compute-0-6.local to: compute-0-7 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 374141704 opcode 0  vendor error 105 qp_idx 3
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 27682 on
node compute-0-6.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------


I try to understand what append and how to fix it (since I think this cause problem to my SCR calculation that has an "open g-shell"). I think anybody who know how abinit write the file in question could help.

Thanks,

SL

Calculation made
- on Intel Xeon proc. E5462 quad-cores (3.0 GHz) 16 GB mem/node, requiring 6 nodes and one proc. per node to maximise total memory,
- with abinit-5.9.1 using Tr.-Mart. psp,
- Here is the .in file (without xcoord, znucl, etc.):

Code: Select all

ndtset 1
acell  15.6  16  8.08613283 angstrom
dilatmx    1.20000000E+00
ecut    20 Hartree
kptopt         1
nband        1300
nstep       500
ngkpt 1 1 10
shiftk 0. 0. 0.
nshiftk 1
getden 1
kssform 3
nbandkss 1300
symmorphi 0
istwfk *1
iscf -2
tolwfr  1.0d-5
zcut 0.0037
Last edited by Levesque on Wed Apr 07, 2010 7:11 pm, edited 1 time in total.

Mamikon Gulian
Posts: 20
Joined: Thu Dec 10, 2009 5:58 pm

Re: anormal(?) memory utilization while writing the WFK file

Post by Mamikon Gulian » Wed Mar 31, 2010 11:43 pm

Hello,

I don't know much about it, but maybe using MPI I/O for wavefunction files will help decrease the dramatic memory utilization by one processor. See the documentation for the variable accesswff:

http://www.abinit.org/documentation/hel ... #accesswff

-Mamikon

Levesque
Posts: 2
Joined: Wed Mar 24, 2010 8:56 pm

Re: anormal(?) memory utilization while writing the WFK file

Post by Levesque » Wed Apr 07, 2010 7:11 pm

Mamikon Gulian wrote:maybe using MPI I/O for wavefunction files will help decrease the dramatic memory utilization by one processor...


Thanks, I will look at that.

Here are more details : in STATUS files, one of the last functions called is vtowfk which call lobpcgccwf. This last one crashed when calling zgemm which is a part of BLAS library (matrix operation). It seems that it keep 4 to 6 copies of the object which explain the problem.

Anyway, the problem is solved when mpw is reduced...

Cheers,

SL

bruneval
Posts: 40
Joined: Mon Aug 17, 2009 11:38 am

Re: [solved] anormal(?) memory utilization while writing WFK

Post by bruneval » Fri Apr 09, 2010 11:00 am

The writing of the KSS file is not yet compatible with the use of the parallelization paral_kgb=1.
The usual strategy is to run a parallel job to output a WFK file and then run a 'fake' sequential job that reads the WFK file and write the KSS file.

Fabien

Locked