Control threads inside oneAPI/MKL/BLAS/cblas_dgemm(and cblas_daxpy) for different multi-threaded schemes

Question

I'm measuring the time-performance of multiple multi-threaded schemes nested with BLAS functions. More specifically, the following calls:

cblas_dgemm(CblasColMajor, CblasTrans, CblasNoTrans, phr, phr, LDA,alpha , A, LDA, B, LDB, beta, C, LDC); cblas_daxpy(N,alpha,X,incX, Y,incY);

The problem consist of computing local contribution of elements and then assemble those into a global matrix. Hence, each dgemm call consist of a small number of operations. Whenever dgemm or daxpy calls are parallelized, the simulation takes longer to execute, and therefore, these functions should be executed in a serialized manner. Note that for very small operations, BLAS does not parallelize dgemm/daxpy calls, hence the matrices here are big enough to be parallelized by default by BLAS calls, but not big enough to justify the usage of additional threads.

A multi-threaded procedure is used in order to compute each element contribution (that calls those BLAS functions) and assemble the local matrices into a global one. Three schemes are evaluated for the best time-performance, each of which is described next.

OMP scheme

It follows the omp scheme. The function ComputingCalcStiffAndAssembling is responsible for computing the local contribution (where BLAS is called) and assembling those into a global matrix. The usage of either a color strategy or atomic_add functions, ensures the operation remains thread safe. A static schecule does not fit this application and therefore the usage of a dynamic one, but the size of the dynamic block was not evaluated and may not be optimal.

> omp_set_num_threads(nthread);
>     #pragma omp parallel for schedule(dynamic,1)
>     for (int64_t iel = 0; iel < nelem; iel++){
>         {
>             TPZCompEl *el = fMesh->Element(iel);
>             if (!el) continue;
>                         
>             ComputingCalcstiffAndAssembling(stiffness,rhs,el);
>         }
>     }

TBB scheme

It follows the TBB call. The body of the loop is similar to the one described in the OMP scheme. The library have support to an atomicAdd function for TBB paradigm, hence the thread safety is ensured either by coloring or via such calls.

tbb::global_control global_limit(tbb::global_control::max_allowed_parallelism, nthread);
        tbb::parallel_for( tbb::blocked_range<int64_t>(0,nelem),
                          [&](tbb::blocked_range<int64_t> r){
        for (int64_t iel = r.begin(); iel < r.end(); iel++)
        {
            TPZCompEl *el = fMesh->Element(iel);
            if (!el) continue;
            
            ComputingCalcstiffAndAssembling(stiffness,rhs,el);

        }
        });

std::thread scheme

This scheme is based on the producer-consumer algorithm. While Threadwork computes the contribution of local elements using multiple threads, an additional thread is reserved for ThreadAssembly that assembles each contribution into a global matrix. The usage of mutex' and semaphores ensures the operation remains thread-safe.

std::vector<std::thread> allthreads;
int itr;
for (itr = 0; itr < numthreads; itr++) {
  allthreads.push_back(std::thread(ThreadData::ThreadWork, &threaddata));
}

ThreadData::ThreadAssembly(&threaddata);

for (itr = 0; itr < numthreads; itr++) {
  allthreads[itr].join();
}

Controlling MKL #Threads

In order to prevent BLAS functions from executing with multiple-threads, the following calls were tested to control MKL number of threads:

mkl_domain_set_num_threads(1, MKL_DOMAIN_BLAS);

This function is supposed to limit the number of threads of BLAS calls. Also,

mkl_set_num_threads_local(1);

Is called before the parallel schemes. This function is supposed to limit the number of threads of all MKL execution, and it is supposed to have a stronger precendence over mkl_domain_set_num_threads call, but that does not always happens. The function mkl_set_num_threads had inferior precendence over mkl_set_num_threads_local on the tests, and it is not taken into account here.

MKL_THREAD_MODEL = OMP

MKL has support for either OMP and TBB models. So far, executing BLAS functions on a single thread was possible for MKL_THREAD_MODEL = OMP for all parallel schemes. Then, a new test was proposed: to control the number of threads of BLAS and the parallel scheme. It was possible to control blas #threads for OMP schemes but not for TBB schemes.

Is there a way of controlling the #threads for cblas calls nested inside a TBB multi-threaded loop?

MKL_THREAD_MODEL = TBB

So far, we could not restrict cblas nested calls to serial execution when MKL is configured with TBB threads.

Is there a way to restrict cblas nested calls from using multiple-threads when MKL_THREAD_MODEL = TBB?

If so, towards having more control over cblas functions #threads,

Is there a way to control cblas nested calls #threads when MKL_THREAD_MODEL = TBB?

Evaluating processor usage for each set-up

The cpu processor usage and the simulation time is measured for each set-up and is displayed in the following table for MKL_THREAD_MODEL=OMP,

MKL_THREAD_MODEL=OMP

Assemble paradigm	MKL_control	#Threads	%CPU	Duration(s)	Comments
OMP	Local	2	200	35.7	Expected
OMP	Domain	2	200	35.5	Expected
TBB	Local	2	1311	62.9	Not expected
TBB	Domain	2	202	35.6	Expected
Serial	Local	1	100	70.0	Expected
Serial	Domain	1	100	69.1	Expected
std::thread	Local	2	2415	80.0	Not Expected
std::thread	Domain	2	208	39.1	Expected

these simulations ran as expected, with the exception of two. The TBB and std::thread schemes are unable to restrain cblas functions to serial execution by setting the number of threads by mkl_set_num_threads_local(1). This finding goes against Intel's suggestion to give preference to this call over mkl_domain_set_num_threads stated in

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/techniques-to-set-the-number-of-threads.html.

Why mkl_set_num_threads_local(1) does not take precedence over mkl_domain_set_num_threads?

Moreover, in the TBB scheme, the %CPU is under 1600%, indicating that the execution is run on a single processor (see technical data in the next section), while for the std::thread scheme, %CPU is over 1600%, indicating that multiple processors are working concomitantly.

Hyperthreading option is enabled in the BIOS, but we can't make sure it is on course during execution. Is there a way to check if hyperthreading is being employed during a particular execution?

The same measurements are made for MKL_THREAD_MODEL=TBB, and the results are shown in the table,

MKL_THREAD_MODEL=TBB

Assemble paradigm	MKL_control	#Threads	%CPU	Duration(s)	Comments
OMP	Local	2	2550	101.5	Not expected
OMP	Domain	2	2865	124.6	Not expected
TBB	Local	2	202	39.4	Not expected
TBB	Domain	2	201	47.3	Not expected
Serial	Local	1	100	69.6	Expected
Serial	Domain	1	2526	247.9	Not expected
std::thread	Local	2	2995	124.9	Not Expected
std::thread	Domain	2	2946	124.1	Not Expected

It was not possible to limit CBLAS execution to a single thread for most of the schemes. Even for the TBB scheme, where we managed to limit the number of parallel threads, the simulation time is not optimal and the time is changing from one execution to the next. It seems that TBB is employing the right number of threads, but weather they are employed on the parallel scheme or on CBLAS execution is not cleared.

Technical information

The experiments are ran onto a 32 processors machine upon a Ubuntu 18.04.3 LTS operational system. Each processor has the following technical data obtained via command cat /proc/cpuinfo:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
stepping        : 4
microcode       : 0x2000064
cpu MHz         : 1000.431
cache size      : 22528 KB
physical id     : 0
siblings        : 32
core id         : 0
cpu cores       : 16
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single

The hyper-threading function is active, as can be seen in option "Thread(s) per core" via command lscpu,

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
Stepping:            4
CPU MHz:             1000.709
CPU max MHz:         3700,0000
CPU min MHz:         1000,0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            22528K

The %CPU detailed in the previous section is the maximum %CPU observed via command top -d 1,

Is there a more appropriate tool to check %CPU than top, more specifically, one capable of telling if hyperthreading is in course or if multiple processors are working at the same time?

Control threads inside oneAPI/MKL/BLAS/cblas_dgemm(and cblas_daxpy) for different multi-threaded schemes

0 Answers0