0

I'm trying to run NAS-UPC benchmarks on a 32 node cluster.

It works fine in cases where the problem size is small . When I graduate to a bigger problem size (CLASS D), I get this error (for MG benchmark)

*** Caught a fatal signal: SIGBUS(7) on node 2/32
 p4_error: latest msg from perror: Bad file descriptor
*** Caught a signal: SIGPIPE(13) on node 0/32
    p4_error: latest msg from perror: Bad file descriptor
   p4_error: latest msg from perror: Bad file descriptor

*** FATAL ERROR: recursion failure in AMMPI_SPMDExit
*** Caught a signal: SIGPIPE(13) on node 27/32
*** Caught a signal: SIGPIPE(13) on node 20/32
*** Caught a signal: SIGPIPE(13) on node 21/32
    p4_error: latest msg from perror: Bad file descriptor
*** FATAL ERROR: recursion failure in AMMPI_SPMDExit
*** FATAL ERROR: recursion failure in AMMPI_SPMDExit
*** FATAL ERROR: recursion failure in AMMPI_SPMDExit
*** Caught a signal: SIGPIPE(13) on node 16/32
*** FATAL ERROR: recursion failure in AMMPI_SPMDExit

Can anybody explain why this is happening , And if anyone has seen this error before and fixed it ?

EDIT : Figured out it is a memory related problem . But I'm unable to allott right amount of memory for application at compile time

Sharat Chandra
  • 4,434
  • 7
  • 49
  • 66

2 Answers2

2

Check a dmesg output - it can be an out-of-memory issue. Or, again, it can be a some from ulimit -a hitted, e.g. a stacksize (default stack size is too small for some NAS tasks).

If you have a lines like "Out of Memory: Killed process ###" in dmesg output on any of your machines - it means that your program required (and tried to use) a lot of memory, bigger than your OS can give to the application. There are several limits of memory:

  1. ulimit -v - user limit for virtual memory size. Check all ulimit -a limits also, but seems that your case is not this
  2. You can use not more memory than you have total RAM and all swap sizes (check with free command). But if your application uses more memory than RAM size, and begin to do swapping - the performance will be bad (in most cases).
  3. There are architectural limits of maximum memory, allowable to single process to have. For 32-bit nodes this limit can be from 1(very rare case) to 2, 3, 4 GB. Even if your 32-bit system have >4 GB of memory, e.g. with using of PAE - no single process can take > 4 Gb. A big part of 4Gb virtual space also taken by OS (from hundreds of MB up to GBs).
osgx
  • 90,338
  • 53
  • 357
  • 513
  • I set stack to unlimited, but the problem persists. (ulimit -s unlimited).. IS there any way I could counter this out-of-memory problem? – Sharat Chandra Mar 24 '11 at 00:03
  • @Sharat Chandra, did you checked a `dmesg` output after a failure to verify is it actually an OOM (out of memory) or anything another? – osgx Mar 24 '11 at 07:29
  • I did , but i do not know how to interpret the output. could you tell me what I should look for from that output ? you should kindly excuse my ignorance in this matter :( – Sharat Chandra Mar 24 '11 at 22:47
  • 1
    Is there a line like "Out of Memory: Killed process ..." in dmesg or in system log (grep in `/var/log/*`)? – osgx Mar 25 '11 at 14:50
0

I figured it is a problem with benchmark needing more memory than i had allotted it during compile time.

Sharat Chandra
  • 4,434
  • 7
  • 49
  • 66
  • When you accept your own answers, they will not be moved to the top, and +2 reputation is not given. – osgx Apr 05 '11 at 21:00