OpenMP nested threads parallelism performance issue

Question

I'm trying to implement a partitioned SpGEMM algorithm on a multi-socket system. The goal is to distribute the multiplication work to all sockets and restrict the memory access to local socket only, so we can enjoy the best memory speed.

The machine I'm using is a two-socket Intel Skylake system, with 24 cores per socket. So I was thinking about using nested parallel regions with 2 threads in the outer region, once a thread encounters the section block, it spawns to 24 threads and performs partitioned SpGEMM.

  omp_set_nested(1);
  omp_set_dynamic(0);

  for (int i = 0; i < ITERS; ++i) {

  start = omp_get_wtime();

  #pragma omp parallel sections num_threads(2)
  {
  #pragma omp section
      {
          SpGEMM(A_upper, B, C, 24);
      }
  #pragma omp section
      {
          SpGEMM(A_lower, B, C, 24);
      }
  }
  end = omp_get_wtime();
  ave_msec += (end - start) * 1000 / ITERS;

Here are my environment vars and cmd arguments:

export OMP_PLACES=sockets
export OMP_PROC_BIND=spread,close
export OMP_NESTED=True
export OMP_MAX_ACTIVE_LEVELS=2

// run the program
numactl --localalloc ./partitioned_spgemm

After looking into some materials from the comments, I can now get thread affinity correctly set up, but the performance is worse than what I would expect.

A C_upper = A_upper * B or a C_lower = A_lower * B on a single socket yield to 700 MFLOPS(flop per second). The original SpGEMM C = A * B on two sockets yields to 900 MFLOPS (as you may know, the number is way lower than 2 * 700 MFLOPS due to NUMA access). With proper thread affinity, I was expecting my partitioned SpGEMM could hit 1400 MFLOPS, but I can only get 400 MFLOPS in my current setup.

I'm using GCC-8.2.0 with OpenMP-4.5, OS Red Hat Enterprise Linux Server 7.6. The SpGEMM I'm using is an outer product based SpGEMM, it uses OpenMP parallelizing outer loops.

I think it must have something to do with nested threads but can not figure it out myself. How can I solve this?

I think there are several questions tagged with both [tag:openmp] and [tag:numa] that address this. See, perhaps, [this](https://stackoverflow.com/questions/50090515/spreading-openmp-threads-among-numa-nodes) question or [this](https://stackoverflow.com/questions/7703069/how-to-make-openmp-thread-or-task-run-on-a-certain-core) one. — 1201ProgramAlarm, Dec 19 '19 at 16:21
You also need to say which OpenMP compiler you are using, and which OS... — Jim Cownie, Dec 20 '19 at 09:14
Thank you guys for the comments, I've looked into some other questions and answers, the thread affinity is now correct but the performance is worse than what I'd expect, could you take another look? — Ernie, Dec 24 '19 at 03:37
Try with `OMP_PROC_BIND="spread,close"` so that the 1st level of threads are spread across the sockets, and the 2nd levels stay each on their own socket — Gilles, Dec 24 '19 at 07:30
@Gilles Thanks, the threads are properly spread by socket now. But the performance is still not there. I've seen some people mentioned nested thread parallelism is bad in performance, is this true? [reference](https://stackoverflow.com/questions/47662767/is-there-a-difference-between-nested-parallelism-and-collapsed-for-loops) — Ernie, Dec 25 '19 at 01:44
What processor model are you using exactly? From your loose description I would expect a processor capable of at least 100 GFLOPS on one socket, but you're only getting 0.7. Can you also tell us what motherboard (or server model) you have, what speed your memory is running at, and the total data size? — John Zwinck, Dec 25 '19 at 03:09
@JohnZwinck I'm using Intel(R) Xeon(R) Platinum 8160 CPU, Sparse matrix-matrix Multiplication is purely memory-bound(1 GFLOPS is normal), this is actually what I've been working on, maximizing bandwidth utilization. On a single socket with only local memory access, I can now get around 45 GB/s bandwidth, which is very close to the result(50 GB/s) from [stream benchmark](https://www.cs.virginia.edu/stream/). But when it comes to two-socket, we have cross socket memory access, the bandwidth can only hit around 70 GB/s. So I wanna do this partition to eliminate cross socket memory access. — Ernie, Dec 25 '19 at 03:28
How many MFLOPS do you get if you use just one core? I'd like to understand if one core is able to saturate the memory bandwidth (of one NUMA node). And how many MFLOPS if you use one core on each socket at the same time? — John Zwinck, Dec 25 '19 at 04:02
45 GB/s memory bandwidth should give you 1.5 GFLOPS (two 8-byte doubles in, one out), yet somehow you only get half that. So I still wonder if there's room to improve your single-socket performance. — John Zwinck, Dec 25 '19 at 04:10

OpenMP nested threads parallelism performance issue

0 Answers0