0

When I call:

MPI_File_open(k_space_communicator, "TB_schro_BS.dat", MPI_MODE_CREATE|MPI_MODE_WRONLY,MPI_INFO_NULL, &bs_out); MPI_File_close(&bs_out);

k_space_communicator works as it is used in many other functions without problem, bs_out is declared: MPI_File bs_out;

TB_schro_BS.dat is file that is deleted before this call if already exists.

I get no immediate errors or hangs (all processes make it out of this function) and the code appears to run perfectly until later in the program, it randomly gets a hang when I try to delete another object.

So, this only occurs when I have a number of processes that is not a factor of the number of calculations I am doing. However, if I comment this line of code out, there are no hang-ups no matter how many processes I am running (obviously less than calculations).

The most confusing thing is that when I traced where the hang-ups occur, I found that it happens in between the end of the object destructor and straight after the call for the deletion of the object, where no extra code is.

Finally, there is one error that is output before the hang that I couldn't quite decipher and occurs between 1-3 times per run for a 4 process run:

*** glibc detected *** mpiuf-nemo: corrupted double-linked list: 0x000000000192a640 ***

I am working in cpp using mpich2 on Ubuntu. Sorry I cannot be more specific as I cannot release detail about the code but I will be happy to try and answer any further questions you may have. Sorry if I have missed anything out. I am a little flustered by this problem.

rene
  • 41,474
  • 78
  • 114
  • 152
rmasp98
  • 97
  • 1
  • 6

1 Answers1

0

You should figure out, probably with valgrind, where that glibc-detected error about a double-linked list came from.

You've mentioned object creation/destruction but your example is using the C bindings. Do you mix the C bindings and the C++ bindings? Most C++ code uses the C bindings just fine.

Rob Latham
  • 5,085
  • 3
  • 27
  • 44
  • Valgrind or any other debugger is not really an option with a hell of a lot of work. This code is massive and for some reason has not had debuggers included. Don't ask me why... The program is a mix of a lot of old and new code so I imagine that there is a mixture of languages but it all worked before I added the part with the error – rmasp98 Apr 29 '14 at 13:25
  • 18 months later, are you still stuck? if valgrind won't work, how about gcc or clang's -fsanitize=address ? – Rob Latham Sep 02 '15 at 14:32