Here is a code that uses the MKL vector addition method:
#include "mkl.h"
#include <ctime>
#include <chrono>
#include <iostream>
int main() {
const int n = 10000000;
int nbRuns = 1000;
double *a = (double *) mkl_malloc(n * sizeof(double), 64);
double *b = (double *) mkl_malloc(n * sizeof(double), 64);
double *c = (double *) mkl_malloc(n * sizeof(double), 64);
for (int i = 0; i < n; i++) {
a[i] = i;
b[i] = i;
}
// First run not considered
vdAdd(n, a, b, c); // MKL call
auto start = std::chrono::system_clock::now();
for (int i = 0; i < nbRuns; i++) {
vdAdd(n, a, b, c); // MKL call
}
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << "Time: " << elapsed.count() << " sec." << std::endl;
return 0;
}
Testing this on one machine gives normal speedup when the number of threads is increased, but on another machine, there is no improvement at all, even though, MKL is using more than one thread (this is visible from the system monitor). I am compiling against mkl_rt on Linux (g++ -std=c++14
). Am I missing something?
Update: It turns out that OpenMP also exhibits the same behavior as shown by the following code (which does not use MKL):
#include <omp.h>
#include <ctime>
#include <chrono>
#include <iostream>
int main() {
const int n = 10000000;
int nbRuns = 1000;
double *a = new double[n];
double *b = new double[n];
double *c = new double[n];
for (int i = 0; i < n; i++) {
a[i] = i;
b[i] = i;
}
auto start = std::chrono::system_clock::now();
for (int i = 0; i < nbRuns; i++) {
#pragma omp parallel for
for (int j = 0; j < n; j++) {
c[j] = a[j] + b[j];
}
}
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << "Time: " << elapsed.count() << " sec." << std::endl;
return 0;
}
The following are some technical info:
Machine 1 (with no speedup): This is a Laptop (hp)
CPU:
model name : Intel(R) Core(TM) i7 CPU Q 840 @ 1.87GHz
cpu MHz : 1199.000
cache size : 8192 KB
Memory:
Handle 0x0004, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 16 GB
Error Information Handle: Not Provided
Number Of Devices: 4
Handle 0x0005, DMI type 17, 27 bytes
Memory Device
Total Width: 64 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: SODIMM
Locator: Top-Slot 1(top)
Bank Locator: BANK 0
Type: DDR3
Type Detail: Synchronous
Speed: 1333 MHz
Machine 2 (normal speedup): This is a single node of large HPC cluster
CPU:
model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
cpu MHz : 2001.000
cache size : 20480 KB
Memory:
The node has 128GiB, but since I do not have root privileges, I cannot gather more info.
Compiler: gcc 6.1 (same effect with 5.3)
g++ -std=c++14 -O3 -fopenmp -o testomp testomp.cpp
Number of threads controlled by OMP_NUM_THREADS