Different floating point answers with different number of processes

Question

I am new to learning MPI and I coded up the following simple program to perform integration with Trapezoidal rule using Open MPI on Ubuntu 10.10. Here is the code:

#include <iostream>
#include <mpi.h>
#include <cstdlib>

//function to integrate
double f (double x )
{
  return 4.0/(1+x*x);
}


//function which integrates the function defined above on the interval local_a and local_b for a given refinement parameters
double Trap(double local_a , double local_b, int local_n , double h)
{

  double integral ;
  double x;

  integral = ( f(local_a) + f(local_b) )/2.0;

  x = local_a ;

  for (int i = 1; i < local_n - 1; ++i)
    {
      x        += h; 
      integral += f(x);
    }

  integral *= h;
  return integral;
}

int main(int argc, char *argv[])
{

  int my_rank;
  int p;
  double a = 0.0;
  double b = 1.0;
  int n = atoi(argv[1]);//number of subdivisions of the interval
  double h;
  double local_a;
  double local_b;
  int local_n;

  double integral;
  double total;
  int source;
  int dest = 0;
  int tag = 0;
  MPI_Status status;

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD,&p);//get number pf processes
  MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);//get rank

  double start , finish;
  MPI_Barrier(MPI_COMM_WORLD);
  start = MPI_Wtime();  

////////////////////////////////////////////////////////////////////////////////////////////////////
  h = (b-a)/n;
  local_n = n/p;

  local_a = a + my_rank*local_n*h;
  local_b = local_a + local_n*h;

  integral = Trap(local_a , local_b , local_n , h);

  if (my_rank==0)
    {
      total = integral;

     for (source = 1; source < p; ++source)
       {
     MPI_Recv(&integral, 1, MPI_DOUBLE , source , tag , MPI_COMM_WORLD, &status );
         total+= integral;

       }
     }

  else
    {
      MPI_Send(&integral, 1, MPI_DOUBLE, dest, tag , MPI_COMM_WORLD);
    }

  if (my_rank == 0)
    {
      printf("With n=%d trapezoids our estimate \n", n );
      printf("Of the integral from %f to %f = %f \n" , a ,b , total);

    }

   ////////////////////////////////////////////////////////////////////////////////////////////////////
  MPI_Barrier(MPI_COMM_WORLD);
  finish = MPI_Wtime();

  if(my_rank == 0)  std::cout << "Time taken is " << finish - start << std::endl ; 

  MPI_Finalize();
  return 0;
}

The function being integrated is f(x) = 4.0 / 1+x^2 which when integrated on [0,1] gives pi = 3.14159...

Now when I ran the program with different number of processes I get different answers. And the difference is quite significant as you can see below.

Desktop: mpirun -np 1 ./a.out 50000
With n=50000 trapezoids our estimate 
Of the integral from 0.000000 to 1.000000 = 3.141553 
Time taken is 0.000718832
Desktop: 
Desktop: 
Desktop: mpirun -np 2 ./a.out 50000
With n=50000 trapezoids our estimate 
Of the integral from 0.000000 to 1.000000 = 3.141489 
Time taken is 0.000422001
Desktop: 
Desktop: 
Desktop: 
Desktop: mpirun -np 3 ./a.out 50000
With n=50000 trapezoids our estimate 
Of the integral from 0.000000 to 1.000000 = 3.141345 
Time taken is 0.000365019
Desktop: 
Desktop: 
Desktop: 
Desktop: mpirun -np 4 ./a.out 50000
With n=50000 trapezoids our estimate 
Of the integral from 0.000000 to 1.000000 = 3.141362 
Time taken is 0.0395319

You should increase the precision in the printout to see what the real difference is: `std::setprecision(std::numeric_limits::digits10)` — Martin York, Mar 21 '12 at 16:17
concerning, different results http://stackoverflow.com/a/8991640/893693 — Stephan Dollberg, Mar 21 '12 at 16:48

score 4 · Accepted Answer · answered Mar 21 '12 at 15:57

You've got 2 different problems in your code:

1. The integration bounds depend on the number of MPI processes, and is wrong when p does not divide n. Namely, the upper bound of the last process is

a + p * int(n/p) * (b-a)/n

which is different from b. I expect this to be the most important error in your code (except if there's another bug I haven't seen)

2. Floating-point operations are neither associative nor commutative. The result of your parallel algorithm, which is aggregated from partial sums, will thus depend on the number of partial sums.

score 2 · Answer 2 · answered Mar 21 '12 at 15:58

When doing floating-point arithmetic the order of operations matters. In real arithmetic (I mean the arithmetic of real numbers) a+b+c+d==a+c+d+b (and any other ordering of the additions). This is not necessarily true for floating-point arithmetic. Since MPI doesn't guarantee to do reductions from M processors to 1 processor in the same order every time it's floating-point behaviour is non-deterministic, at least as far as most of us need be concerned.

Putting that to one side, the differences between the results on the varying number of processors does look rather larger than I would expect. Looking at your code I think that you use integer arithmetic in this line:

local_n = n/p;

which leads to small parts of the total area not being assigned to any of the processes for calculation. By my reckoning the line

local_b = local_a + local_n*h;

does not set local_b to 1.0 for the last process.

Different floating point answers with different number of processes

2 Answers2