I'm trying to implement a partitioned SpGEMM algorithm on a multi-socket system. The goal is to distribute the multiplication work to all sockets and restrict the memory access to local socket only, so we can enjoy the best memory speed.
The machine I'm using is a two-socket Intel Skylake system, with 24 cores per socket. So I was thinking about using nested parallel regions with 2 threads in the outer region, once a thread encounters the section block, it spawns to 24 threads and performs partitioned SpGEMM.
omp_set_nested(1);
omp_set_dynamic(0);
for (int i = 0; i < ITERS; ++i) {
start = omp_get_wtime();
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
SpGEMM(A_upper, B, C, 24);
}
#pragma omp section
{
SpGEMM(A_lower, B, C, 24);
}
}
end = omp_get_wtime();
ave_msec += (end - start) * 1000 / ITERS;
Here are my environment vars and cmd arguments:
export OMP_PLACES=sockets
export OMP_PROC_BIND=spread,close
export OMP_NESTED=True
export OMP_MAX_ACTIVE_LEVELS=2
// run the program
numactl --localalloc ./partitioned_spgemm
After looking into some materials from the comments, I can now get thread affinity correctly set up, but the performance is worse than what I would expect.
A C_upper = A_upper * B
or a C_lower = A_lower * B
on a single socket yield to 700 MFLOPS(flop per second). The original SpGEMM C = A * B
on two sockets yields to 900 MFLOPS (as you may know, the number is way lower than 2 * 700 MFLOPS due to NUMA access). With proper thread affinity, I was expecting my partitioned SpGEMM could hit 1400 MFLOPS, but I can only get 400 MFLOPS in my current setup.
I'm using GCC-8.2.0 with OpenMP-4.5, OS Red Hat Enterprise Linux Server 7.6. The SpGEMM I'm using is an outer product based SpGEMM, it uses OpenMP parallelizing outer loops.
I think it must have something to do with nested threads but can not figure it out myself. How can I solve this?