Calculation stops without error when switching on D-field

juhl · Post by **juhl** » Fri Sep 15, 2017 6:42 pm

Dear Abinit community,

I am trying to include a D-field (displacement) in a calculation on CeO2. Calculations without field ran fine with this setup and I was following the tutorial for applying an E-field. However, in the first run that includes a (D)-field, the calculation always stops during the third SCF cycle. I tried a variety of things but the problem persists, even when setting the field to 0. Since I am fairly new to Abinit and there is no real error message, I was hoping to find some help if anything is not done correctly. Input and Output files are attached.

mverstra · Post by **mverstra** » Sun Sep 17, 2017 4:13 pm

Hi Juhl,

you are mixing lots of stuff: parallel, PAW, DFT+U calculations of the constant D. Could you try (with small ecut and nkpt) to see if things crash in sequential, without U, and for norm conserving, or with combinations of the three?

If a small calculation with all options on crashes, then we have something. It could be simply that something is not deallocated properly with this combination, and after a few iterations you overflow the memory. The "official" memory footprint is very small.

cheers

Matthieu

juhl · Post by **juhl** » Wed Sep 20, 2017 12:39 am

Dear Matthieu,

agreed. Trying to locate the error, I ran the same system with:

A) no U
B) no U and no PAWs
C) no U no PAWs but nsppol 2

All these give the same result (crash during iteration #3 in DATASET 21).

Since you suggested to further locate the error, I assume you did not spot any error/inconsistency in the input. Other than a bug, are there other physical things that could go wrong for a D-field calculation?

I am running
D) no U and no PAWs sequential
right now and will update asap.

mverstra · Post by **mverstra** » Wed Sep 20, 2017 8:59 am

The D code has not been extensively used so there may well be authentic bugs in compatibility with certain features like paw or non collinear spin. In principle if your input is wrong the code should complain rather than proceed and die.

How fast a calculation can you make by reducing ecut and nkpt but still crash?

Keep us posted.

juhl · Post by **juhl** » Wed Sep 20, 2017 4:09 pm

UPDATE:

The sequential run made it into DATASET 31, i.e., this suggests that the problem is the mpiparallelisation, which is quite unfortunate for my intended purpose. To confirm this, could you provide an example where D-field ran successfully in parallel before? Since the crash occurs during DATASET 21, i.e., the first run with active D-field, it suggests that the problem is within the routine for the D-field.

to answer your question, I can run this example right now in about 2 minutes until it crashes.

UPDATE2:

As additional test I took the test tffield_6 from the abinit/tests/tutorespfn directory. I first ran the test as is, i.e., with efield. Both serial as well as parallel versions of the code are running fine. Switching from efield to dfield, I get the same error as in my example: the calculation stops during iteration step #3 in the first dataset the dfield is applied.

mverstra · Post by **mverstra** » Wed Sep 20, 2017 10:50 pm

Hi again,

I have tried your input with ecut 6 and ngkpt 1 1 1 and 2 2 2, and it does crash (even in sequential), perhaps for the same reason as you: what I get is an internal consistency check which fails, and the job stops.

scfcv (electric field calculation) : WARNING -
The difference between pel (electronic Berry phase updated
at each SCF cycle)
and pel_cg (electronic Berryphase computed using the berryphase routine) is
pdif_mod = 0.675410089E-06

--- !WARNING
src_file: elpolariz.F90
src_line: 261
message: |

pel_cg(1) = 0.172603609E-09
pel_cg(2) = -0.134600664E-08
pel_cg(3) = -0.934581435E-05
pel(1) = -0.933478809E-06
pel(2) = -0.800590687E-06
pel(3) = 0.812486637E-03
---- KILL ---

If you look at the corresponding source file elpolariz.F90 , the output is followed by the command MSG_ERROR, which kills the job, instead of MSG_WARNING, which allows it to continue. If I replace ERROR with WARNING in that line and recompile, then use 2 2 2 k-points, I run through to dtset 31 in sequential, and things stall there after iteration 3. It could be that this numerical acuracy (2 ways of calculating the polarization which do not agree) is broken by PAW, or by U, or simply needs to be converged away with nkpt...

And it could be that this is a borderline case due to low nkpt or ecut, and is not related to your problems...

juhl · Post by **juhl** » Wed Sep 20, 2017 11:25 pm

Hi,

based on the testing so far, why would it be related to +U, PAW? Isn't it much more likely that the serial version appears to run okay and that the problem is in the parallel part (to test this I am running including PAW and +U in serial right now and it is already past the point where the parallel calculation breaks)?

But what you got is interesting. When I use the e-filed instead of d-field, I get the same error that you get (elpolariz.F90)

Running with the setup you wrote (ecut 6, nkkopt 1 1 1). I get the old crash (DATASET 21, iteration #3) without the error you got, which might indicate that writing this error might be different in our versions? (as it can be seen in my .out file I am using 8.4.3) Than it could indeed all be related to the elpolariz error. I'll try increasing the kgrid.

UPDATE: hmm, looking in the code, this test should be there for berryopt 4 and 6, so it would be strange if this would be the problem without being printed, so I think this possibility can be disregarded, coming back to the fact that the problem appears to be only in the parallel part. Furthermore, I already tried doubling the kgrid to 8 8 8, and the behavior was the same (crash, no further error).

mverstra · Post by **mverstra** » Wed Sep 20, 2017 11:34 pm

Not being printed may simply be because of a caching problem on the nodes you use, if the writing to the log file is not flushed properly as the code dies you do not have the last chunk of text and the error.

First thing: turn MSG_ERROR it into a MSG_WARNING and try in seq and parallel.

juhl · Post by **juhl** » Thu Sep 21, 2017 7:43 pm

No, after recompiling with MSG_WARNING instead of error (l.258 in elpolariz.F90), I get the same behavior, i.e., the parallel run stops in DATASET 21, Iteration #3

juhl · Post by **juhl** » Mon Sep 25, 2017 8:37 pm

Any further ideas? It seems like the issue should be not too hard to fix once the error is more or less localized since the sequential code works fine (I guess the numerical problems you ran into could have been caused by the reduced cutoff. I am trying to reproduce this behavior, but I think the important message is still that the parallel run fails while the sequential does not, with otherwise unchanged parameters.)

EDIT:
You mentioned that the D-field code was not tested extensively yet. Could it be that it just never ran in the mpi version and we are looking for a bug?

Comparing the output from the parallel run (fail) and sequential run (i.e., the last/first print statements before/after the code breaks), a bug should be caused in 79_seqpar_mpi/vtorho.F90 in between lines 489 and 861.

mverstra · Post by **mverstra** » Mon Sep 25, 2017 10:30 pm

Hi Again

No precise suggestions short of running with one thread under gdb to catch the real error. I will not have time to look into this for some days. If you want to try:
1) set the code running in parallel in the background.
2) use ps aux | grep abinit to find the process ID (pid) of the main executing instance (usually the lowest pid of the abinits)
3) run gdb /PATH/abinit <pid>
And it should latch on to the running executable and pause it.
4) type "cont" (no quotes) to get gdb to continue
5) when it crashes it should give you a stack of subroutines that had been called. In the best case if it can find the sources of Abinit, gdb will give you a line number in the file.
6) type "up" recursively to visit calling subroutines and see where everything is being called from.
7) send us all of this!

Perhaps you already know all of this, but it will be helpful to someone someday.

ABINIT Discussion Forums

Calculation stops without error when switching on D-field

Calculation stops without error when switching on D-field

Re: Calculation stops without error when switching on D-fiel

Re: Calculation stops without error when switching on D-fiel

Re: Calculation stops without error when switching on D-fiel

Re: Calculation stops without error when switching on D-fiel

Re: Calculation stops without error when switching on D-fiel

Re: Calculation stops without error when switching on D-fiel

Re: Calculation stops without error when switching on D-fiel

Re: Calculation stops without error when switching on D-fiel

Re: Calculation stops without error when switching on D-fiel

Re: Calculation stops without error when switching on D-fiel