0

I have the following code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>

static int rank, size;

char msg[] = "This is a test message";

int main(int argc, char **argv) {
    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (size != 2) {
        fprintf(stderr, "This test requires exactly 2 tasks (has: %d).\n", size);
        MPI_Finalize();
        return -1;
    }

    int run = 1;
    if (argc > 1) {
        run = atoi(argv[1]);
    }

    int len = strlen(msg) + 1;
    if (argc > 2) {
        len = atoi(argv[2]);
    }

    char buf[len];

    strncpy(buf, msg, len);

    MPI_Status statusArray[run];

    MPI_Request reqArray[run];


    double start = MPI_Wtime();

    for (int i = 0; i < run; i++) {
        if (!rank) {
          MPI_Isend(buf, len, MPI_CHAR, 1, 0, MPI_COMM_WORLD, &reqArray[i]);
          printf("mpi_isend for run %d\n", i);
        } else {
          MPI_Irecv(buf, len, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &reqArray[i]);
          printf("mpi_irecv for run %d\n", i);
        }
    }
    int buflen = 512;
    char name[buflen];
    gethostname(name, buflen);
    printf("host: %s has rank %d\n", name, rank);
    printf("Reached here! for host %s before MPI_Waitall \n", name);
    if(!rank) {
      printf("calling mpi_waitall for sending side which is %s\n", name);
      MPI_Waitall(run, &reqArray[0], &statusArray[0]);
    }
    else {
      printf("calling mpi_waitall for receiving side which is %s\n", name);
      MPI_Waitall(run, &reqArray[0], &statusArray[0]);
    }
    printf("finished waiting! for host %s\n", name);
    double end = MPI_Wtime();
    if (!rank) {
      printf("Throughput: %.4f Gbps\n", 1e-9 * len * 8 * run / (end - start));
    }

    MPI_Finalize();
}

I got a seg-fault on the sending side before MPI_Waitall. The error message is:

[host1:27679] *** Process received signal ***
[host1:27679] Signal: Segmentation fault (11)
[host1:27679] Signal code: Address not mapped (1)
[host1:27679] Failing at address: 0x8
[host1:27679] [ 0] /lib64/libpthread.so.0() [0x3ce7e0f500]
[host1:27679] [ 1] /usr/lib64/openmpi/mca_btl_openib.so(+0x21dc7) [0x7f46695c1dc7]
[host1:27679] [ 2] /usr/lib64/openmpi/mca_btl_openib.so(+0x1cbe1) [0x7f46695bcbe1]
[host1:27679] [ 3] /lib64/libpthread.so.0() [0x3ce7e07851]
[host1:27679] [ 4] /lib64/libc.so.6(clone+0x6d) [0x3ce76e811d]
[host1:27679] *** End of error message ***

I think there is something wrong with the array of MPI_Request. Could someone point it out? Thanks!

Wesley Bland
  • 8,816
  • 3
  • 44
  • 59
Ra1nWarden
  • 1,170
  • 4
  • 21
  • 37

1 Answers1

3

I ran your program without a problem (other than a warning for not including unistd.h). The problem is probably related to your setup of Open MPI. Are you using a machine with an InfiniBand network? If not, you probably want to change to just use the default tcp implementation. Your problem might be related to that.

If you want to specify that you'll only use tcp, you should run like this:

mpirun --mca btl tcp,self -n 2 <prog_name> <prog_args>

That will ensure that openib isn't accidentally detected and used when it shouldn't be.

If, on the other hand, you do mean to use InfiniBand, you might have discovered some sort of problem with Open MPI. I doubt that's the case though since you're not doing anything fancy.

Wesley Bland
  • 8,816
  • 3
  • 44
  • 59
  • Yes, I am running on machines with an IB network and I do intend to use that interface. – Ra1nWarden Jan 24 '14 at 19:42
  • In that case, it's probably something else related to Open MPI. I retagged this question to add the Open MPI tag so hopefully one of those guys will come along soon and help. – Wesley Bland Jan 24 '14 at 21:17
  • You can also post your question to the Open MPI users mailing list (http://www.open-mpi.org/community/lists/ompi.php) if you don't get a response here. – Wesley Bland Jan 24 '14 at 21:17
  • 1
    From the stack trace, this appears to be either the async event thread or the service thread of the `openib` component. It could be a bug in the library or resource exhaustion (e.g. too many outstanding requests). – Hristo Iliev Jan 25 '14 at 00:56
  • Yeah, I actually wrote this to create congestion. The final goal is to look for traffic control packets using `ibdump`. Thank you for your help. :) – Ra1nWarden Jan 27 '14 at 03:56