No improvement with intel MKL

Question

Here is a code that uses the MKL vector addition method:

#include "mkl.h"
#include <ctime>
#include <chrono>
#include <iostream>

int main() {
    const int n = 10000000;
    int nbRuns = 1000;
    double *a = (double *) mkl_malloc(n * sizeof(double), 64);
    double *b = (double *) mkl_malloc(n * sizeof(double), 64);
    double *c = (double *) mkl_malloc(n * sizeof(double), 64);

    for (int i = 0; i < n; i++) {
        a[i] = i;
        b[i] = i;
    }
    // First run not considered
    vdAdd(n, a, b, c); // MKL call

    auto start = std::chrono::system_clock::now();
    for (int i = 0; i < nbRuns; i++) {
        vdAdd(n, a, b, c); // MKL call
    }
    auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed = end - start;
    std::cout << "Time: " << elapsed.count() << " sec." << std::endl;

    return 0;
}

Testing this on one machine gives normal speedup when the number of threads is increased, but on another machine, there is no improvement at all, even though, MKL is using more than one thread (this is visible from the system monitor). I am compiling against mkl_rt on Linux (g++ -std=c++14). Am I missing something?

Update: It turns out that OpenMP also exhibits the same behavior as shown by the following code (which does not use MKL):

#include <omp.h>
#include <ctime>
#include <chrono>
#include <iostream>

int main() {
    const int n = 10000000;
    int nbRuns = 1000;
    double *a = new double[n];
    double *b = new double[n];
    double *c = new double[n];

    for (int i = 0; i < n; i++) {
        a[i] = i;
        b[i] = i;
    }

    auto start = std::chrono::system_clock::now();
    for (int i = 0; i < nbRuns; i++) {
        #pragma omp parallel for
        for (int j = 0; j < n; j++) {
            c[j] = a[j] + b[j];
        }
    }
    auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed = end - start;
    std::cout << "Time: " << elapsed.count() << " sec." << std::endl;

    return 0;
}

The following are some technical info:

Machine 1 (with no speedup): This is a Laptop (hp)
CPU:
    model name  : Intel(R) Core(TM) i7 CPU       Q 840  @ 1.87GHz
    cpu MHz     : 1199.000
    cache size  : 8192 KB
Memory:
Handle 0x0004, DMI type 16, 15 bytes
Physical Memory Array
    Location: System Board Or Motherboard
    Use: System Memory
    Error Correction Type: None
    Maximum Capacity: 16 GB
    Error Information Handle: Not Provided
    Number Of Devices: 4
Handle 0x0005, DMI type 17, 27 bytes
Memory Device
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 4096 MB
    Form Factor: SODIMM
    Locator: Top-Slot 1(top)
    Bank Locator: BANK 0
    Type: DDR3
    Type Detail: Synchronous
    Speed: 1333 MHz

Machine 2 (normal speedup): This is a single node of large HPC cluster
CPU:
    model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
    cpu MHz         : 2001.000
    cache size      : 20480 KB
Memory:
    The node has 128GiB, but since I do not have root privileges, I cannot gather more info.
Compiler: gcc 6.1 (same effect with 5.3)
    g++ -std=c++14 -O3 -fopenmp -o testomp testomp.cpp
Number of threads controlled by OMP_NUM_THREADS

One is a Xeon with 8 cores (here there is speedup from around 20 sec on a single core to around 7 sec in 4 cores), the other is a corei7 with 4 cores (here time practically does no change, 22 sec for 1 core 20 sec for 4 cores). — S. K., May 31 '16 at 16:39
@NathanOliver MKL uses the CPU's vector instructions. It should actually be *faster* than using a quad core because it can perform 4 additions at once — Panagiotis Kanavos, May 31 '16 at 16:40
@S.K. what are the processor models exactly? Different CPUs support different instructions — Panagiotis Kanavos, May 31 '16 at 16:41
@PanagiotisKanavos I get that. I asked as I was hoping the second test was not on a single core machine(I have seen it asked before). — NathanOliver, May 31 '16 at 16:42
@PanagiotisKanavos Although, I do not think the CPU model is the issue, but here is the CPU info: One machine: Intel(R) Core(TM) i7 CPU Q 840 @ 1.87GHz. The other Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz — S. K., May 31 '16 at 16:45
What really bothers me is that I can see several cores running, but time stays the same! I was think that mkl_rt is doing some weird choices under the hood, but with any reasonable setting, there should be come improvement when parallelizing something this simple. — S. K., May 31 '16 at 16:50
You are likely memory bound, and the xeon has more memory channels. The data size of ~24 MB is a bit difficult because it is very close to the 20 MB cache size of the xeon. Please provide information on the memory of the system (speed, #channels). — Zulan, May 31 '16 at 17:08
By the way: please include all this information in the question itself by editing it. Context absolutely matters - especially when it comes to performance! The MKL version and the way you control the number of threads would also be of interest. — Zulan, May 31 '16 at 17:17
Please provide more details: how are you compiling/linking your test program exactly, how are you controlling the number of threads, etc. — Kenneth Hoste, May 31 '16 at 17:39

No improvement with intel MKL

0 Answers0