2

I just begin to learn something about mpi, So I bought 3 vps to create a experiment enviornment. I successfully installed and configed the ssh and mpich. The three nodes could ssh each other (but not itself) without password. And the cpi example passed without any ptoblem on local machine. When I tried to run it on all the 3 nodes, the cpi program always exist with error Fatal error in PMPI_Reduce: Unknown error class, error stack:. Here is the full description what i did and what the error said.

[root@fire examples]# mpiexec -f ~/mpi/machinefile  -n 6 ./cpi
Process 3 of 6 is on mpi0
Process 0 of 6 is on mpi0
Process 1 of 6 is on mpi1
Process 2 of 6 is on mpi2
Process 4 of 6 is on mpi1
Process 5 of 6 is on mpi2
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1263)...............: MPI_Reduce(sbuf=0x7fff1c18c440, rbuf=0x7fff1c18c448, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(826)..........:
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(881)..........:
MPIR_Reduce_binomial(188).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(636): Communication error with rank 1
MPIR_Reduce_binomial(188).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(636): Communication error with rank 2
MPIR_Reduce_intra(846)..........:
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(881)..........:
MPIR_Reduce_binomial(250).......: Failure during collective

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 1563 RUNNING AT mpi0
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@mpi2] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:2@mpi2] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@mpi2] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:1@mpi1] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1@mpi1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@mpi1] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@mpi0] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@mpi0] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@mpi0] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@mpi0] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

I just have no clue what happened, some insights? As the comment suggests, here is the mpi cpi code.

#include "mpi.h"
#include <stdio.h>
#include <math.h>

double f(double);

double f(double a)
{
    return (4.0 / (1.0 + a*a));
}

int main(int argc,char *argv[])
{
    int    n, myid, numprocs, i;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int    namelen;
    char   processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);
    MPI_Get_processor_name(processor_name,&namelen);

    fprintf(stdout,"Process %d of %d is on %s\n",
    myid, numprocs, processor_name);
    fflush(stdout);

    n = 10000;          /* default # of rectangles */
    if (myid == 0)
    startwtime = MPI_Wtime();

    MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

    h   = 1.0 / (double) n;
    sum = 0.0;
    /* A slightly better approach starts from large i and works back */
    for (i = myid + 1; i <= n; i += numprocs)
    {
        x = h * ((double)i - 0.5);
        sum += f(x);
    }
    mypi = h * sum;

    MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

    if (myid == 0) {
        endwtime = MPI_Wtime();
        printf("pi is approximately %.16f, Error is %.16f\n",
               pi, fabs(pi - PI25DT));
        printf("wall clock time = %f\n", endwtime-startwtime);         
        fflush(stdout);
    }

    MPI_Finalize();
    return 0;
}
spiritsaway
  • 635
  • 1
  • 7
  • 16
  • There is no possible way to fix the error without the source code. – m0nhawk Apr 01 '15 at 14:19
  • i just used the cpi example which came along with the mpich package. – spiritsaway Apr 02 '15 at 15:41
  • I did search the mpich mail list and found someone had the same problem. But there was only one reply implied there may be network problem. Maybe I should setup two physical instance to continue my experiment. – spiritsaway Apr 02 '15 at 15:48
  • 1
    Actually, this can be because of the network connection problem and when some nodes have MPICH instead of OpenMPI and vice verse. First, I'd recommend to check that all have the same environment and then try some simpler example. – m0nhawk Apr 02 '15 at 15:53

2 Answers2

6

Its is probably too late, anyway I will provide my answer, I encountered the same problem and after some research I figured out the issue

If you have a machinefile with hostnames instead of ip-addresses and have the machines connected locally then you should have a nameserver running locally as well or else change the entries in your machine file to ip-address instead of hostnames. Having just /etc/hosts will not solve the issue

This seems to be my problem and once I changed the entires in machine file to ip-addresses it works

Regards GOPI

GOPI
  • 131
  • 2
  • GOPI, As I know, MPICH uses hostnames from `-f machinefile` to ssh to all machines, starts helper processes (pmi_proxy) and task processes. Then every process will get current hostname (just like command [`hostname`](http://linux.die.net/man/1/hostname)), convert it locally to IP (like `hostname -i`), and pass this IP to all other processes with help of PMI. All MPI processes will use such IP addresses to connect to each other. So, you and spiritsaway may do ssh to all hosts (using names from `machinefile`), collect all `hostname -i` outputs and ping all IPs from all hosts. – osgx Apr 26 '15 at 19:31
  • Thank you @GOPI. This made all the difference! :-) – Lord Loh. Nov 09 '17 at 06:47
0

My four Raspberry Pi cluster (model B's) had the same problem.

I had setup my version of RASPBIAN to use "ufw" for my firewall and set up "ssh" to use a RSA key with a "passphrase" for each Raspberry Pi. It was not until I distributed the public key for each pi (see ssh-copy-id) to every other pi did I get past the above error message.

Mind you it is a bit tedious to run ssh-agent and then run ssh-add on each Raspberry Pi prior to running "mpiexec" (I've still to find out whether pssh/parallel-ssh can help with the setup).

Galen
  • 1