Page 1 of 1

[SOLVED] VASP 5 crashes when using several computing nodes (large memory)

Posted: Tue Oct 09, 2012 1:34 pm
by ivasan
Hello Everyone,

I have compiled VASP 5.3.2 without errors and it runs properly when I am only using one computing node in our cluster (the computer node has 2 hexacore processors). The problem arises when I try to use more than one computing node: it crashes after at the very beginning. In particular, at the "Iteration 1(1)", just after finishing EDDIAG (when the same simulation is running in just one computing node, next step is RMM-DIIS).

The solution of this post http://cms.mpi.univie.ac.at/vasp-forum/ ... hp?3.11392 didn't solve the problem.

I have posted in dropbox several files where you can find all the information I could gather about this issue:

- File 'Makefile' ( https://dl.dropbox.com/u/27436218/Makefile): log used for making VASP. In brief: I used Intel MPI 4.0.3, Intel Compilers XE2013, Intel MKL BLACS and FFTW3 (from Intel Toolkit XE2013).

- File 'simul.log' ( https://dl.dropbox.com/u/27436218/simul.log ): messages that appear at the screen while the simulation that crashes is running. I used several options within mpirun to get the information related to MPI calls since it seems that the problem is there ( -v -check_mpi -genv I_MPI_DEBUG 5). The interesting information is at the end of the file.

- File 'INCAR' ( https://dl.dropbox.com/u/27436218/INCAR ) : the input file of the simulation I am trying to run, just in case it is meaningful. In brief, it is an ionic relaxation. This input file works fine if I only use one computing node, so I don't think that the problem is here.

- File 'OUTCAR' ( https://dl.dropbox.com/u/27436218/OUTCAR ): the output file of the simulation.

It seems from simul.log that the errors are related to MPI since there are messages such as:

Code: Select all

[23]?ERROR:?LOCAL:EXIT:SIGNAL:?fatal?error
[23]?ERROR:????Fatal?signal?11?(SIGSEGV)?raised.
[23]?ERROR:????Signal?was?encountered?at:
[23]?ERROR:???????hamil_mp_hamiltmu_?(/home/ivasan/progrmas/VASP/vasp.5.3_test/vasp)
[23]?ERROR:????After?leaving:
[23]?ERROR:???????mpi_allreduce_(*sendbuf=0x7fff5d1ce340,?*recvbuf=0x18e19c0,?count=1,?datatype=MPI_DOUBLE_PRECISION,?op=MPI_SUM,?comm=0xffffffffc4060000?CART_SUB?CART_CREATE?CART_SUB?CART_CREATE?COMM_WORLD??[18:23],?*ierr=0x7fff5d1ce2ac->MPI_SUCCESS)
I will appreciate if anyone could give me a hint about what can I check/modify in order to solve this problem.

Thank you very much in advance for your answers and your time.

Kind regards,

Ivan






<span class='smallblacktext'>[ Edited ]</span>

[SOLVED] VASP 5 crashes when using several computing nodes (large memory)

Posted: Fri Oct 26, 2012 2:11 pm
by ivasan
Dear all,

I have being doing my homework and I found some solutions.

First of all I have to say that the same job worked perfectly in different nodes for vasp 4.6.38, but it failed when using vasp 5.2.12 and 5.2.3 (the three versions compiled with the same options, libraries and MPI).

Finally, I found a solution that worked for me through a post of this forum (http://cms.mpi.univie.ac.at/vasp-forum/ ... php?2.5776). I will summarize that post here. There are different recommendations at this post:

- Including "ulimit -s unlimited" in the .bashrc or in the .bash_profile files. This didn't work for me.

- - Including the option "-heap-arrays" when compiling the application, but in my case the tasks "eat" all the memory of the computing nodes and they "died".

- Including the option "-mcmodel=large" when compiling the application. This didn't work in my case.

- Adding a file to the compilation: the file limit.c that contains this code:

Code: Select all

#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
void stacksize_()
{
int res;
struct rlimit rlim;

getrlimit(RLIMIT_STACK, &rlim);
printf("Before: cur=%d,hard=%d\n",(int)rlim.rlim_cur,(int)rlim.rlim_max);

rlim.rlim_cur=RLIM_INFINITY;
rlim.rlim_max=RLIM_INFINITY;
res=setrlimit(RLIMIT_STACK, &rlim);

getrlimit(RLIMIT_STACK, &rlim);
printf("After: res=%d,cur=%d,hard=%d\n",res,(int)rlim.rlim_cur,(int)rlim.rlim_max);
}
Additionally, you have to add "limit.o" the end of the variable SOURCE in the Makefile, and

Code: Select all

limit.o: limit.c 
	icc -c -Wall -O2 limit.c
at the end of the Makefile (I used icc, but it migh be gcc in other cases), and

Code: Select all

CALL stacksize()
in the file main.F just after the section

Code: Select all

!===========================================
! initialise / set constants and parameters ...
!===========================================
This was the option that worked for me both in vasp 5.2.12 and 5.3.2.

Just for information, these are the different compiler/libraries I used:

Intel Compilers XE2013 (V13.0)
Intel MKL 11.0
Intel MPI 4.1
VASP was compiled with BLACS, LAPACK, ScaLAPACK and FFTW from MKL.

Regards,

Ivan
<span class='smallblacktext'>[ Edited Tue Nov 06 2012, 02:12PM ]</span>

[SOLVED] VASP 5 crashes when using several computing nodes (large memory)

Posted: Tue Nov 06, 2012 2:16 pm
by ivasan
Dear all,

I found that the previous does not work at all. Now, ALL the simulations crashes at the end although they start and run apparently without problems.

The error that I get is

Code: Select all

APPLICATION?TERMINATED?WITH?THE?EXIT?STRING:?Killed?(signal?9)

Since I am studying systems with many atoms, I think that I have the same problem as the one posted here http://cms.mpi.univie.ac.at/vasp-forum/ ... hp?3.12143.

There must be something wrong with VASP when using a large amount of memory.

Regards,

Ivan

<span class='smallblacktext'>[ Edited Tue Nov 06 2012, 02:17PM ]</span>

[SOLVED] VASP 5 crashes when using several computing nodes (large memory)

Posted: Mon Jan 21, 2013 7:46 pm
by ivasan
Dear all,

I keep on going with this topic.

My last findings are related to registerable memory limits. I found that our Infiniband switch has a limit for the maximum amount of registerable memory. In our case we have a Mellanox switch, and people from Mellanox recomnend to set the value of

Code: Select all

(2^log_num_mtt)*(2^log_mtts_per_seg)*PAGE_SIZE
at least the double of the physical available memory at the nodes (link1, link2). You can check the values of these parameters with:

Code: Select all

getconf PAGE_SIZE
cat /sys/module/mlx4_core/parameters/log_num_mtt
cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
Mellanox people only recommend to change log_num_mtt. To do it you have to edit the file /etc/modprobe.conf and adding at the end of the file the line: options mlx4_core log_num_mtt=24. Then you have to restart the Infiniband network by doing the following in all the nodes of the cluster:

Stop opensm service: /etc/init.d/opensmd stop
Restart IB: /etc/init.d/openibd restart
Start opensm: /etc/init.d/opensmd start
Check the changes: cat /sys/module/mlx4_core/parameters/log_num_mtt

BUT: the problem is still there.

I have found a useful presentation (link3 that recomends to use chunks when sending large messages through the network. In my case I am trying to relax Si systems with ~220 atoms, with a 4x4x4 MP k-point mesh, which requires huge memory, so I guess that also large messages are sent.

How does VASP send large messages?

Cheers,

Ivan

[SOLVED] VASP 5 crashes when using several computing nodes (large memory)

Posted: Wed Jan 30, 2013 8:50 pm
by ivasan
Dear all,

It seems that the modifications I previously posted solve the problem in most of the cases.

However, I am still facing the problem for some particular cases.

Regards,

Ivan

[SOLVED] VASP 5 crashes when using several computing nodes (large memory)

Posted: Thu Jan 31, 2013 12:52 pm
by askhetan
Dear ivasan,

Wish I had seen your thread earlier but I am facing very similar problems when I go to higher KPOINTs. I would like to share how my jobs crash. For a reasonable (3x3 cell ) metal oxide system (~100 atoms) if I use 3x3x1 Monkhorst KPs then the system runs till the final electronic minimization step of the very first ionic step and then I noticed that suddenly the top file shows that each processor on the cluster requires over 2.5 times the memory it usually takes during the electronic minimization steps..... eventually the jobs crash with SIGTERM(78), which essentially means that one of the nodes crashed and subsequently the whole job crashed.

And with 4x4x1 KPOINTS, the jobs don't even start on the first electronic step.

I also just saw your post with the silicon atoms case and somehow i think for both of us, the problem lies in the memory which seems to scale absurdly with the system size.
I will post my err file below