Race condition issue in large-scale parallel jobs?
Posted: Fri Nov 14, 2008 6:15 pm
I was doing a scaling study of vasp.4.6.34 (5Dec07 gamma-only) on a BG/P, with a system of 144 ions (with 110592 plane-waves, 240 Bands) going from 8 to 1024 processors.
Things went pretty well from 8 to 256 - no major issues in running, but when I got to 512 and 1024, things stopped working (code would start to run then hang until killed or timed out).
Now one weird aspect was that in the 512 processor case, the stdout file and the OSZICAR output didn't match up:
stdout:
Meanwhile the OSZICAR looked like:
So it appears possible that some part of the code was exited the electronic loop for the 2nd ionic step, but the rest of the processes did not (the outcar doesn't show anything past the 4th electronic iteration for the second ionic step).
Any ideas what could be going on here?
Any ideas on what to check to see if there is a race condition showing up or is it as simple as trying to spread the system out too far (but I see similar behavior with larger systems, so I doubt this)?
<span class='smallblacktext'>[ Edited ]</span>
Things went pretty well from 8 to 256 - no major issues in running, but when I got to 512 and 1024, things stopped working (code would start to run then hang until killed or timed out).
Now one weird aspect was that in the 512 processor case, the stdout file and the OSZICAR output didn't match up:
stdout:
Code: Select all
 POSCAR, INCAR and KPOINTS ok, starting setup
 WARNING: wrap around errors must be expected
 FFT: planning ... 1
 reading WAVECAR
 prediction of wavefunctions initialized - no I/O
 entering main loop
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:Â Â Â 1Â Â Â Â -0.134616312131E+04Â Â Â -0.13462E+04Â Â Â -0.15113E+05Â Â Â 240Â Â Â 0.804E+02
RMM:Â Â Â 2Â Â Â Â Â 0.678906973816E+02Â Â Â Â 0.14141E+04Â Â Â -0.31181E+04Â Â Â 240Â Â Â 0.241E+02
RMM:Â Â Â 3Â Â Â Â -0.373161624624E+03Â Â Â -0.44105E+03Â Â Â -0.87906E+03Â Â Â 240Â Â Â 0.156E+02
RMM:Â Â Â 4Â Â Â Â -0.665909002304E+03Â Â Â -0.29275E+03Â Â Â -0.23943E+03Â Â Â 240Â Â Â 0.871E+01
RMM:Â Â Â 5Â Â Â Â -0.724955435073E+03Â Â Â -0.59046E+02Â Â Â -0.57775E+02Â Â Â 240Â Â Â 0.439E+01
RMM:Â Â Â 6Â Â Â Â -0.737867636598E+03Â Â Â -0.12912E+02Â Â Â -0.14283E+02Â Â Â 240Â Â Â 0.218E+01
RMM:Â Â Â 7Â Â Â Â -0.740535632918E+03Â Â Â -0.26680E+01Â Â Â -0.38981E+01Â Â Â 240Â Â Â 0.117E+01
RMM:Â Â Â 8Â Â Â Â -0.741178187819E+03Â Â Â -0.64255E+00Â Â Â -0.10515E+01Â Â Â 240Â Â Â 0.598E+00
RMM:Â Â Â 9Â Â Â Â -0.741330045880E+03Â Â Â -0.15186E+00Â Â Â -0.34026E+00Â Â Â 593Â Â Â 0.345E+00
RMM:Â Â 10Â Â Â Â -0.741360945514E+03Â Â Â -0.30900E-01Â Â Â -0.35481E-01Â Â Â 551Â Â Â 0.671E-01
RMM:Â Â 11Â Â Â Â -0.741359545649E+03Â Â Â Â 0.13999E-02Â Â Â -0.11029E-02Â Â Â 475Â Â Â 0.155E-01
RMM:Â Â 12Â Â Â Â -0.741359528547E+03Â Â Â Â 0.17101E-04Â Â Â -0.97017E-04Â Â Â 461Â Â Â 0.335E-02Â Â Â Â 0.450E+01
RMM:Â Â 13Â Â Â Â -0.661558370867E+03Â Â Â Â 0.79801E+02Â Â Â -0.28370E+02Â Â Â 512Â Â Â 0.187E+01Â Â Â Â 0.180E+01
RMM:Â Â 14Â Â Â Â -0.652731303590E+03Â Â Â Â 0.88271E+01Â Â Â -0.16616E+01Â Â Â 539Â Â Â 0.540E+00Â Â Â Â 0.111E+01
RMM:Â Â 15Â Â Â Â -0.651277021638E+03Â Â Â Â 0.14543E+01Â Â Â -0.53115E+00Â Â Â 482Â Â Â 0.407E+00Â Â Â Â 0.178E+00
RMM:Â Â 16Â Â Â Â -0.651156842750E+03Â Â Â Â 0.12018E+00Â Â Â -0.97604E-01Â Â Â 523Â Â Â 0.129E+00Â Â Â Â 0.856E-01
RMM:Â Â 17Â Â Â Â -0.651150359979E+03Â Â Â Â 0.64828E-02Â Â Â -0.62397E-02Â Â Â 505Â Â Â 0.360E-01Â Â Â Â 0.304E-01
RMM:Â Â 18Â Â Â Â -0.651164434310E+03Â Â Â -0.14074E-01Â Â Â -0.37067E-02Â Â Â 481Â Â Â 0.308E-01Â Â Â Â 0.288E-01
RMM:Â Â 19Â Â Â Â -0.651161916093E+03Â Â Â Â 0.25182E-02Â Â Â -0.78862E-03Â Â Â 491Â Â Â 0.113E-01Â Â Â Â 0.473E-02
RMM:Â Â 20Â Â Â Â -0.651162051035E+03Â Â Â -0.13494E-03Â Â Â -0.13765E-03Â Â Â 469Â Â Â 0.518E-02Â Â Â Â 0.504E-02
RMM:Â Â 21Â Â Â Â -0.651161956873E+03Â Â Â Â 0.94162E-04Â Â Â -0.13006E-04Â Â Â 326Â Â Â 0.177E-02
   1 T=  2000. E= -.61419337E+03 F= -.65116196E+03 E0= -.65116196E+03  EK= 0.36969E+02 SP= 0.00E+00 SK= 0.00E+00
 bond charge predicted
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:Â Â Â 1Â Â Â Â -0.651138583236E+03Â Â Â -0.65114E+03Â Â Â -0.21660E+00Â Â Â 480Â Â Â 0.311E+00Â Â Â Â 0.455E-01
RMM:Â Â Â 2Â Â Â Â -0.651132473309E+03Â Â Â Â 0.61099E-02Â Â Â -0.31310E-02Â Â Â 542Â Â Â 0.367E-01Â Â Â Â 0.255E-01
RMM:Â Â Â 3Â Â Â Â -0.651132118441E+03Â Â Â Â 0.35487E-03Â Â Â -0.45908E-03Â Â Â 525Â Â Â 0.127E-01Â Â Â Â 0.991E-02
RMM:Â Â Â 4Â Â Â Â -0.651132065402E+03Â Â Â Â 0.53039E-04Â Â Â -0.74368E-04Â Â Â 476Â Â Â 0.401E-02
   2 T=  2000. E= -.61416348E+03 F= -.65113207E+03 E0= -.65113207E+03  EK= 0.36969E+02 SP= 0.00E+00 SK= 0.00E+00
Meanwhile the OSZICAR looked like:
Code: Select all
N       E                     dE             d eps       ncg     rms          rms(c)
RMM:Â Â Â 1Â Â Â Â -0.134616312131E+04Â Â Â -0.13462E+04Â Â Â -0.15113E+05Â Â Â 240Â Â Â 0.804E+02
RMM:Â Â Â 2Â Â Â Â Â 0.678906973816E+02Â Â Â Â 0.14141E+04Â Â Â -0.31181E+04Â Â Â 240Â Â Â 0.241E+02
RMM:Â Â Â 3Â Â Â Â -0.373161624624E+03Â Â Â -0.44105E+03Â Â Â -0.87906E+03Â Â Â 240Â Â Â 0.156E+02
RMM:Â Â Â 4Â Â Â Â -0.665909002304E+03Â Â Â -0.29275E+03Â Â Â -0.23943E+03Â Â Â 240Â Â Â 0.871E+01
RMM:Â Â Â 5Â Â Â Â -0.724955435073E+03Â Â Â -0.59046E+02Â Â Â -0.57775E+02Â Â Â 240Â Â Â 0.439E+01
RMM:Â Â Â 6Â Â Â Â -0.737867636598E+03Â Â Â -0.12912E+02Â Â Â -0.14283E+02Â Â Â 240Â Â Â 0.218E+01
RMM:Â Â Â 7Â Â Â Â -0.740535632918E+03Â Â Â -0.26680E+01Â Â Â -0.38981E+01Â Â Â 240Â Â Â 0.117E+01
RMM:Â Â Â 8Â Â Â Â -0.741178187819E+03Â Â Â -0.64255E+00Â Â Â -0.10515E+01Â Â Â 240Â Â Â 0.598E+00
RMM:Â Â Â 9Â Â Â Â -0.741330045880E+03Â Â Â -0.15186E+00Â Â Â -0.34026E+00Â Â Â 593Â Â Â 0.345E+00
RMM:Â Â 10Â Â Â Â -0.741360945514E+03Â Â Â -0.30900E-01Â Â Â -0.35481E-01Â Â Â 551Â Â Â 0.671E-01
RMM:Â Â 11Â Â Â Â -0.741359545649E+03Â Â Â Â 0.13999E-02Â Â Â -0.11029E-02Â Â Â 475Â Â Â 0.155E-01
RMM:Â Â 12Â Â Â Â -0.741359528547E+03Â Â Â Â 0.17101E-04Â Â Â -0.97017E-04Â Â Â 461Â Â Â 0.335E-02Â Â Â Â 0.450E+01
RMM:Â Â 13Â Â Â Â -0.661558370867E+03Â Â Â Â 0.79801E+02Â Â Â -0.28370E+02Â Â Â 512Â Â Â 0.187E+01Â Â Â Â 0.180E+01
RMM:Â Â 14Â Â Â Â -0.652731303590E+03Â Â Â Â 0.88271E+01Â Â Â -0.16616E+01Â Â Â 539Â Â Â 0.540E+00Â Â Â Â 0.111E+01
RMM:Â Â 15Â Â Â Â -0.651277021638E+03Â Â Â Â 0.14543E+01Â Â Â -0.53115E+00Â Â Â 482Â Â Â 0.407E+00Â Â Â Â 0.178E+00
RMM:Â Â 16Â Â Â Â -0.651156842750E+03Â Â Â Â 0.12018E+00Â Â Â -0.97604E-01Â Â Â 523Â Â Â 0.129E+00Â Â Â Â 0.856E-01
RMM:Â Â 17Â Â Â Â -0.651150359979E+03Â Â Â Â 0.64828E-02Â Â Â -0.62397E-02Â Â Â 505Â Â Â 0.360E-01Â Â Â Â 0.304E-01
RMM:Â Â 18Â Â Â Â -0.651164434310E+03Â Â Â -0.14074E-01Â Â Â -0.37067E-02Â Â Â 481Â Â Â 0.308E-01Â Â Â Â 0.288E-01
RMM:Â Â 19Â Â Â Â -0.651161916093E+03Â Â Â Â 0.25182E-02Â Â Â -0.78862E-03Â Â Â 491Â Â Â 0.113E-01Â Â Â Â 0.473E-02
RMM:Â Â 20Â Â Â Â -0.651162051035E+03Â Â Â -0.13494E-03Â Â Â -0.13765E-03Â Â Â 469Â Â Â 0.518E-02Â Â Â Â 0.504E-02
RMM:Â Â 21Â Â Â Â -0.651161956873E+03Â Â Â Â 0.94162E-04Â Â Â -0.13006E-04Â Â Â 326Â Â Â 0.177E-02
   1 T=  2000. E= -.61419337E+03 F= -.65116196E+03 E0= -.65116196E+03  EK= 0.36969E+02 SP= 0.00E+00 SK= 0.00E+00
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:Â Â Â 1Â Â Â Â -0.651138583236E+03Â Â Â -0.65114E+03Â Â Â -0.21660E+00Â Â Â 480Â Â Â 0.311E+00Â Â Â Â 0.455E-01
RMM:Â Â Â 2Â Â Â Â -0.651132473309E+03Â Â Â Â 0.61099E-02Â Â Â -0.31310E-02Â Â Â 542Â Â Â 0.367E-01Â Â Â Â 0.255E-01
RMM:Â Â Â 3Â Â Â Â -0.651132118441E+03Â Â Â Â 0.35487E-03Â Â Â -0.45908E-03Â Â Â 525Â Â Â 0.127E-01Â Â Â Â 0.991E-02
RMM:Â Â Â 4Â Â Â Â -0.651132065402E+03Â Â Â Â 0.53039E-04Â Â Â -0.74368E-04Â Â Â 476Â Â Â 0.401E-02
So it appears possible that some part of the code was exited the electronic loop for the 2nd ionic step, but the rest of the processes did not (the outcar doesn't show anything past the 4th electronic iteration for the second ionic step).
Any ideas what could be going on here?
Any ideas on what to check to see if there is a race condition showing up or is it as simple as trying to spread the system out too far (but I see similar behavior with larger systems, so I doubt this)?
<span class='smallblacktext'>[ Edited ]</span>