C/C++ MPI speedup is not as expected

Question

I am trying to write an MPI application to speedup a math algorithm with a computer cluster. But before this I am doing some kind of benchmarking. But the first results are not as much as expected.

The test application has linear speedup with 4 cores but 5,6 cores are not speeding up the application. I am doing a test with Odroid N2 platform. It has 6 cores. Nproc says there are 6 cores available.

Am I missing some kind of configuration? Or is my code not prepared well enought ( it is based on one of the base example of mpi)?

Is there any response time or syncronization time which shall be considered ?

Here are some measures from my MPI based application. I measured a total calculation time for a function.

1 core 0.838052sec
2 core 0.438483sec
3 core 0.405501sec
4 core 0.416391sec
5 core 0.514472sec
6 core 0.435128sec
12 core (4 core from 3 N2 boards) 0.06867sec
18 core (6 core from 3 N2 boards) 0.152759sec

I did a benchmark with raspberry pi4 with 4 core:

1 core 1.51 sec
2 core 0.75 sec
3 core 0.69 sec
4 core 0.67 sec

And this is my benchmark application:

int MyFun(int *array, int num_elements, int j)
{
  int result_overall = 0;

  for (int i = 0; i < num_elements; i++)
  {
    result_overall += array[i] / 1000;
  }
  return result_overall;
}

int compute_sum(int* sub_sums,int num_of_cpu)
{
  int sum = 0;
  for(int i = 0; i<num_of_cpu; i++)
  {
    sum += sub_sums[i];
  }
  return sum;
}

//measuring performance from main(): num_elements_per_proc is equal to 604800
  if (world_rank == 0)
  {
    startTime = std::chrono::high_resolution_clock::now();
  }
  // Compute the sum of your subset
  int sub_sum = 0;
  for(int j=0;j<1000;j++)
  {
    sub_sum += MyFun(sub_intArray, num_elements_per_proc, world_rank);
  }

  MPI_Allgather(&sub_sum, 1, MPI_INT, sub_sums, 1, MPI_INT, MPI_COMM_WORLD);

  int total_sum = compute_sum(sub_sums, num_of_cpu);
  if (world_rank == 0)
  {
    elapsedTime = std::chrono::high_resolution_clock::now() - startTime;
    timer = elapsedTime.count();
  }

I build it with -O3 optimization level.

UPDATE: new measures:

60480 sample, MyFun called 100000 times: 1.47 -> 0.74 -> 0.48 -> 0.36
6048 samples, MyFun called 1000000 times: 1.43 -> 0.7 -> 0.47 -> 0.35
6048 samples, MyFun called 10000000 times: 14.43 -> 7.08 -> 4.72 -> 3.59

UPDATE2: By the way when I list the CPU info in linux I got this:

Is this normal? The quad-core A73 core is not present. And it says there are two sockets with 3-3 cores.

And here is the CPU utilization with sar: Seems like all of the cores are utilized.

I create some plots from speedup:

Seems like calculation on float instead of int helps a bit but the core 5-6 do not help much. And I think memory bandwidth is okay. Is this a normal behavior when utilizing all CPU equally with little.BIG architecture?

If `num_elements` is large, then your program is likely memory-bound and won't see much speed-up when adding new cores. — Hristo Iliev, Jul 16 '20 at 19:59
You can perform a simple [roofline model](https://en.wikipedia.org/wiki/Roofline_model) analysis. Take the inner loop of `MyFun` and count how many cycles it takes to execute one iteration. Count how many bytes it reads from the memory (assuming `result_overall` is kept in a register). Divide the processor frequency by the cycle count and multiply by the number of bytes. Divide the maximum memory bandwidth by that amount and you'll get the theoretical limit of the number of independent copies of the loop that can run simultaneously without affecting each one's performance. — Hristo Iliev, Jul 16 '20 at 20:24
Thanks for the comments. The total number of element is 604800. Which leads to 2.3GB-of RAM. The N2 has 4 GB of RAM. In the clinic the RAm is divided into to parts one is simple memory as 3.5GB and there is a Z RAM partition with 1GB. Anyway, were you talking about cache bound or RAM bound? I will do a test with different size. — D_Dog, Jul 17 '20 at 05:12
I’m talking about the bandwidth of the closest level of the memory hierarchy that is both shared between the cores and big enough to hold all of the data. If we are talking about 2,3 GB of data, that would obviously be the RAM. But 604800 integers actually take only a couple of MBs, so they might be able to fit in the L2 cache too. Make sure that you tell the MPI library to pin the ranks to the proper CPU cores since N2’s SoC is big.LITTLE and mixing the two core clusters in a job that gives each rank the same amount of work is very inefficient. — Hristo Iliev, Jul 17 '20 at 05:56
Oh sorry, yeah you are right. I miss calculated... It is 2.3MB. But later I would like to use more data up to 2-3GB... And thaks for the suggestions. I will try these also. — D_Dog, Jul 17 '20 at 06:27
I put new measures to my question. Seems like something is improved if I decreased the array. Next I will try fine tuning of the usage of the MPI as you suggested. — D_Dog, Jul 17 '20 at 08:00
@HristoIliev I did some measurements for speedup. And seems like only the 5-6 mpi processes do not doing speedup. Do you think this is because the equally distributed load on the big.LITTLE architecture? — D_Dog, Jul 20 '20 at 08:46
Yes, the LITTLE cores are slowing everything down. If you want to fully utilise all 6 cores, you should employ a work distribution strategy suitable for heterogeneous architectures, e.g., one that distributes tasks dynamically or one that adjusts task size based on some performance measure. — Hristo Iliev, Jul 20 '20 at 08:59
Regarding the output of `lscpu`, it obviously is not able to properly display the CPU topology, perhaps because it seeks to condense all the information in a single CPU entry. See the output of `cat /proc/cpuinfo` instead. — Hristo Iliev, Jul 20 '20 at 09:01
Thank you very much for the input. It definitely makes more sense for me. — D_Dog, Jul 20 '20 at 11:21

C/C++ MPI speedup is not as expected

0 Answers0

Linked