What could make MPI_File_write_all fail with Floating point exception?

Question

I have a call to MPI_File_write_all:

double precision buf[100][100][100];
int data_size = 100*100*100;
MPI_Status stat_mpi;
MPI_file sgfh;

... 

MPI_File_write_all(sgfh, (void*)buf, data_size, MPI_DOUBLE, &stat_mpi);

The size of buf can vary, 100^3 is just an example. Under certain circumstances that I still don't have a complete handle on, the call to MPI_File_write_all fails with a floating-point exception. Everything I can test -- the buf array, value of data_size -- checks out OK.

Any idea what could cause this? I get the same error with Cray and gnu compilers, and regardless of optimization levels.

Sorry I don't have a small code that can repeat the problem. Stripping it down to bare essentials would still leave a code too big for this page.

score 2 · Accepted Answer · answered Oct 26 '16 at 02:39

2

The floating point exception likely comes from when the two-phase collective buffering algorithm tries (for some buggy reason) to divide by zero, and I've only seen that happen on Lustre when the stripe count is somehow incorrect.

You can verify this theory by disabling collective I/O. Easiest way with Cray MPI is to set the MPICH_MPIIO_HINTS environment variable:

export MPICH_MPIIO_HINTS='*:romio_cb_write=disable'
aprun ... your_program

Cray made the business decision to close-source their MPI-IO modifications to ROMIO. That choice is well within their rights but it means I can only offer vague suggestions. You'll have to contact your Cray support contact for an actual bug fix.

answered Oct 26 '16 at 02:39

Rob Latham

5,085
3
27
44

You nailed it, dude! I would have never figured it out! Thank you! – bob.sacamento Oct 26 '16 at 14:53
disabling collective I/O might have some serious performance implications,but "slow" beats "crashes", right? – Rob Latham Oct 26 '16 at 18:17
Oh the performance becomes absolutely horrible when I disable collective I/O, but at least now I'm not going to spend days trying to track down a problem in the code that isn't ever there! – bob.sacamento Oct 26 '16 at 18:51

What could make MPI_File_write_all fail with Floating point exception?

1 Answers1