defective inter-node parallelism with 2017 Intel compilers
Posted: Tue Mar 20, 2018 1:20 pm
Hello!
After trying a lot to solve this, I thought to ask help on this forum as well. Thank you in advance for your advice.
I would like to start using a system that only has the compilers from Intel Composer 2017. Have configured as suggested by the Intel MKL advisor and I can get the abinit executables. The behavior I notice is that abinit calculates OK when running with processors of a single node, but whenever I to use >1 nodes, either the geometry relaxation refuses to converge after a few Broyden steps, crashes after a few Broyden steps, or the calculation completes, but in a longer time than in case of using a single node. So, intra-node parallelism is OK, but inter-node parallelism is not. The behavior is essentially the same with abinit versions 7.8.1, 8.2.2 and 8.6.3.
If I compile same versions of abinit using compilers of Intel Composer 2013 (on another platform that accepts both 2013 and 2017 versions) , it is all OK: my test calculations converge and the time reasonably scales with the number of processors used. And the 2017 compilers still don't provide me with executables that run well on more than 1 node on that platform too.
It might also be relevant to mention that I always need to do some tricks to allow the ./configure to recognize the Intel compilers. In addition to specifying the MPI prefix, I have to manually replace mpif90-> mpiiforc etc (similar for C and C++) inside the configure file, and also specify with_fc_vendor="intel", with_fc_version="17.0.5" etc (similar for C and C++) in the .ac file. If I don't do the above, either the configuration fails or I must manually copy all the .mod files gradually generated in /src directories in the src/mods or src/incs directory in order for the compilation to succeed. I wonder if some irregular behavior is not caused by the configure script being unable to recognize the compilers by itself as it should... though again, using 2013 compilers it always runs OK.
I can provide more details (log files or specific build options used) if someone has an idea about what could be done in order to make inter-node parallelism work with 2017 Intel compilers.
I have seen that someone reported some issues when using more than 1 node back in 2012, though that was for 2013 Intel compilers that in my case work fine, and also I don't always get job crashes. Thus the cause is probably different.
viewtopic.php?f=3&t=1851
Many thanks for reading and for any clues.
Dan
After trying a lot to solve this, I thought to ask help on this forum as well. Thank you in advance for your advice.
I would like to start using a system that only has the compilers from Intel Composer 2017. Have configured as suggested by the Intel MKL advisor and I can get the abinit executables. The behavior I notice is that abinit calculates OK when running with processors of a single node, but whenever I to use >1 nodes, either the geometry relaxation refuses to converge after a few Broyden steps, crashes after a few Broyden steps, or the calculation completes, but in a longer time than in case of using a single node. So, intra-node parallelism is OK, but inter-node parallelism is not. The behavior is essentially the same with abinit versions 7.8.1, 8.2.2 and 8.6.3.
If I compile same versions of abinit using compilers of Intel Composer 2013 (on another platform that accepts both 2013 and 2017 versions) , it is all OK: my test calculations converge and the time reasonably scales with the number of processors used. And the 2017 compilers still don't provide me with executables that run well on more than 1 node on that platform too.
It might also be relevant to mention that I always need to do some tricks to allow the ./configure to recognize the Intel compilers. In addition to specifying the MPI prefix, I have to manually replace mpif90-> mpiiforc etc (similar for C and C++) inside the configure file, and also specify with_fc_vendor="intel", with_fc_version="17.0.5" etc (similar for C and C++) in the .ac file. If I don't do the above, either the configuration fails or I must manually copy all the .mod files gradually generated in /src directories in the src/mods or src/incs directory in order for the compilation to succeed. I wonder if some irregular behavior is not caused by the configure script being unable to recognize the compilers by itself as it should... though again, using 2013 compilers it always runs OK.
I can provide more details (log files or specific build options used) if someone has an idea about what could be done in order to make inter-node parallelism work with 2017 Intel compilers.
I have seen that someone reported some issues when using more than 1 node back in 2012, though that was for 2013 Intel compilers that in my case work fine, and also I don't always get job crashes. Thus the cause is probably different.
viewtopic.php?f=3&t=1851
Many thanks for reading and for any clues.
Dan