MPI BMP Image comparison more efficient

Question

I made a simple program in which I compare two images pixel by pixel and determine if the pictures are the same. I'm trying to adapt it to MPI, but I'm afraid that the communications are taking too long making it way more inefficient than its sequential counterpart. I have tried with images of very big resolution and the result is the same: the sequential code is more efficient than the parallel code. Is there's a way of making it more efficient?

Sequential Code:

    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>

    unsigned char* BMP(char* filename,int* sizes)
    {
        int i;
        FILE* f = fopen(filename, "rb");
        unsigned char info[54];
        fread(info, sizeof(unsigned char), 54, f);

        int ancho = *(int*)&info[18];
        int alto = *(int*)&info[22];

        int size = 3 * ancho * alto;
        *sizes = size;
        unsigned char* data = new unsigned char[size];
        fread(data, sizeof(unsigned char), size, f);
        fclose(f);
        for(i = 0; i < size; i += 3)
        {
                unsigned char tmp = data[i];
                data[i] = data[i+2];
                data[i+2] = tmp;
        }

        return data;
    }

    int main(int argc,char **argv){
      int sizes,i,bol;
      clock_t t1,t2;
      double tiemp;
      t1 = clock();
      bol=1;
      unsigned char* data1= BMP(argv[1],&sizes);
      unsigned char* data2= BMP(argv[2],&sizes);
      for (i =0; i<sizes; i += 3)
      {
        if(data1[i]!=data2[i]){
          printf("The images are not the same\n");
          bol=0;
          break;}
        }

      if(bol==1)
       printf("The images are the same\n");

       t2 = clock();
       tiemp = ((double) (t2 - t1)) / (CLOCKS_PER_SEC);
       printf("%f\n",tiemp );
      return 0;
     }

MPI counter part

    #include <stdio.h>
    #include <stdlib.h>
    #include <mpi.h>
    #include <time.h>

    unsigned char* BMP(char* filename,int* sizes)
    {
        int i;

        FILE* f = fopen(filename, "rb");
        unsigned char info[54];
        fread(info, sizeof(unsigned char), 54, f);

        int ancho = *(int*)&info[18];
        int alto = *(int*)&info[22];

        int size = 3 * ancho * alto;
        *sizes = size;
        unsigned char* data = new unsigned char[size];
        fread(data, sizeof(unsigned char), size, f);
        fclose(f);
        for(i = 0; i < size; i += 3)
        {
                unsigned char tmp = data[i];
                data[i] = data[i+2];
                data[i+2] = tmp;
        }

        return data;
    }

    int main(int argc,char **argv){
      int sizes,i,world_rank,world_size;
      clock_t t1,t2;
      double tiemp;
      t1 = clock();

      MPI_Init(&argc, &argv);
      MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
      MPI_Comm_size(MPI_COMM_WORLD, &world_size);
      unsigned char* data1;
      unsigned char* data2;
      int root = 0;
      if(world_rank==0){
      data1= BMP(argv[1],&sizes);
      data2= BMP(argv[2],&sizes);
      printf("%d",sizes);
      }
      MPI_Bcast(&sizes,1,MPI_INT,root,MPI_COMM_WORLD);
      int num_elements_por_proc = sizes/world_size;
      unsigned char* subdata2=new unsigned char[num_elements_por_proc];
      unsigned char* subdata1=new unsigned char[num_elements_por_proc];
      MPI_Scatter( data1, num_elements_por_proc, MPI_UNSIGNED_CHAR, subdata1, num_elements_por_proc, MPI_UNSIGNED_CHAR, root, MPI_COMM_WORLD );
      MPI_Scatter( data2, num_elements_por_proc, MPI_UNSIGNED_CHAR, subdata2, num_elements_por_proc, MPI_UNSIGNED_CHAR, root, MPI_COMM_WORLD );
      int bol = 0;
      if(world_rank!=0){


        for(i=0;i<=num_elements_por_proc;i++){
          if(subdata1[i]!=subdata2[i]){
            bol = 1;
            break;
          }
         }
     }
     int bolls;
     MPI_Reduce(&bol,&bolls,1, MPI_INT, MPI_SUM, 0,MPI_COMM_WORLD);

     if(world_rank==0){
      if(bolls !=0){
        printf("The images are not the samen");}
      else{
        printf("The images are the same \n" );}

     t2 = clock();
     tiemp = ((double) (t2 - t1)) / (CLOCKS_PER_SEC);
     printf("%f\n",tiemp );
     }
     MPI_Finalize();
     return 0;
     }

This question may be better suited to Stack Exchange's [Code Review](http://codereview.stackexchange.com/) site. — Brandon Anzaldi, Jun 02 '16 at 03:38

score 1 · Accepted Answer · answered Jun 02 '16 at 07:32

This code is not suitable for parallelization. Your bottleneck is very likely just reading the file. Even if the file was already in memory of the root process, sending the data and then only looking at each data element (actually only 1/3 of them) once, cannot be faster than doing it on the root process itself.

The only way to exploit parallelism here would be to store the files distributed, and read them distributed. Then you could for instance compute a hash on each node and compare those.

A few more remarks:

Consider using MPI_LOR (logical or) for reduction instead of addition
std::swap instead of tmp
Pair each new with a delete, even in example code.
Format your code properly. For your own sake and for the sake of people having to read it here. If you are lazy, use a tool like clang-format.

score 1 · Answer 2 · answered Jun 02 '16 at 14:51

Besides being I/O bound as explained by @Zulan, your algorithm has a fundamental property, which makes it unsuitable for parallelisation. To understand why, look at the following specifically constructed extreme case.

You have two images and they differ only in their first (when linearised) pixel and are otherwise the same. You now divide the image into N parts and distribute them to N ranks to compare. The first rank immediately finds a difference, breaks the loop, and enters the MPI_Reduce call, but the other N-1 ranks have to go over their entire iteration ranges before they reach to the conclusion that the image parts are the same. MPI_Reduce is a collective operation and will only complete once all participating ranks have called into it, in other words not before the N-1 ranks have fully examined their image segments.

The serial program will find the difference on the very first iteration and break the loop immediately. This is a clear case where the parallel version simply cannot be faster and on the contrary is considerably slower. It also illustrates an example of load imbalance - different processes perform varying amount of work and the faster ones have to wait for the slower to complete. You could implement some kind of a notification mechanism and have the first rank to finish notify the others so that they could break out of their comparison loops. This is way more suitable for implementation on shared-memory systems with a paradigm such as OpenMP, although the cancellation mechanisms of the latter come at a cost.

On the other extreme, if the images are the same, then the parallel program will run N times faster than the serial. If the images differ in their (length/N)-th pixel, then the parallel version will take the same amount of time as the serial one.

The parallel speed-up is thus quite unpredictable and very sensitive to the actual input.

Hello, and thank you for your response! The problem is even with the same image-same size, The parallel version is slower, I can not even obtain the same time as the serial. — Andrés Arámburo, Jun 02 '16 at 19:51

MPI BMP Image comparison more efficient

2 Answers2