0

I have a C++ code using MS MPI (using Boost MPI). Usually I run it using Windows HPC Pack cluster (12 nodes, each have 32 cores). It has no problem running with one, two, or four nodes. But when I try to use 12 nodes to run, it runs for some time and eventually fails (every time, not succeeding once). The Error message from output is like this:

job aborted:
[ranks] message

[0] process exited without calling finalize

[1-383] terminated

---- error analysis -----

[0] on XXXXX
Model.exe ended prematurely and may have crashed. exit code 0xc0000409

---- error analysis -----

The output from that error is not readable, something like below:

A
A
s
A
s
s
e
s
r
s
t
s
A
e
i
e
A
r
o
r
A
t
n
t
s
A

i
A
A
A
i
f
o
s
o
A
s
A
n
s
A
A
A
a
s
s
n
s
A
s

A
s
s
s
i
A
A

s
A
A
s

If you can give any suggestions on debugging this, that will be great. Thanks

yacc
  • 2,915
  • 4
  • 19
  • 33
user11594134
  • 119
  • 8
  • The exit code means you have a stack buffer overrun: https://stackoverflow.com/questions/23409809 – rtoijala Jun 03 '19 at 19:17
  • Also, the output looks suspiciously like "Assertion failure", though it's difficult to be sure. – rtoijala Jun 03 '19 at 19:20
  • Thanks @rtoijala, your comment is really helpful. The program runs fine in 4 nodes but failed in 12 nodes. It is likely the stack buffer overrun or something but it is really difficult to figure out which part. The output look like assertion failure, but I suspect all MPI process spit out error message that make it unreadable – user11594134 Jun 04 '19 at 14:49
  • You might want to try to get an MPI debugger like [this one](https://www.microsoft.com/en-us/download/details.aspx?id=48215) to work. Then you could find out where the error occurs. – rtoijala Jun 05 '19 at 06:45
  • Thanks so much @rtoijala. This is very useful information. Unfortunately, I do not know how to do a debug on a 12 HPC nodes. Anyway, I was debug at a very naive way. Comment out part by part to see where the problem is. It turns out one of class I am gather using Boost MPI turn out to be too big (a lot of arrays with large size), I guess Boost MPI/ MS MPI has some limit on this. I might need to change the gather from one time to a two stage gather (1 stage at the node level, 2nd state gather from nodes). Hope this works. Thanks so much for your help. Really appreciate it. – user11594134 Jun 05 '19 at 19:38

0 Answers0