Multithreading slows program: no False-sharing, no mutex, no cache misses, no small workload

Question

Multi-threading slows down my code, even though I've payed attention to these posts:

Multi-threaded GEMM slower than single threaded one?

Why is this OpenMP program slower than single-thread?

I think all the precautions were taken care of:

My CPU is 4 cores + hyperthreading (8 effectively) and I don't run more than 4 threads
Number of vector entries each thread works on seems large enough (2 million per thread). Therefore any false-sharing (cache line problem) should be negligible because most data doesn't overlap with data of other threads.
Entries are consecutive in memory, possibility of a cache miss is very small.
using a tmp variable for consecutive operations, instead of assigning values directly into the array.
Building in release mode, visual studio
There are no critical points between threads (they don't use mutexes and don't share data)

When measuring time, I am including the creating of a thread. Surely, launching 4 threads can't be that expensive?

1 thread: around 140 milliseconds

4 threads: around 155 milliseconds

Main:

struct MyStruct {
   double val = 0;
};


size_t numEntries = 100e4;
size_t numThreads = 4;
std::vector<MyStruct> arr;


void main(){
    arr.reserve(numEntries);
    for(size_t i=0; i<numEntries; ++i){
        MyStruct m{ i };
        arr.push_back(m);
    }

    //run several times 
    float avgTime=0;
    for(size_t n=0; n<100; ++n){
        launchThreads(avgTime);
        //space out to make avgTime more even:
        std::this_thread::sleep_for(std::chrono::milliseconds(10));

    }

    avgTime /= 100;

    std::cout << "finished in " << avgTime <<"milliseconds\n";
    system("pause");
}

Launching and running the threads:

//ran by each thread
void threadWork(size_t threadId){
    size_t numPerThread = (numEntries+numThreads -1) / numThreads;

    size_t start_ix = threadId * numPerThread;

    size_t endIx;
    if (threadId == numThreads - 1) {
        endIx = numEntries-1;//we are the last thread
    }
    else {
        endIx = start_ix + numPerThread;
    }

    for(size_t i=5; i<endIx-5; ++i){
        double tmp = arr[i].val; 

        tmp += arr[i-1].val;
        tmp += arr[i-3].val;
        tmp += arr[i-4].val;
        tmp += arr[i-5].val;
        tmp += arr[i-2].val;

        tmp += arr[i+1].val;
        tmp += arr[i+3].val;
        tmp += arr[i+4].val;
        tmp += arr[i+5].val;
        tmp += arr[i+2].val;

        if(tmp > 0){ tmp *= 0.5f;}
        else{ tmp *= 0.3f; }

        arr[i].val = tmp;
    }
}//end()


//measures time
void launchThreads(float &avgTime){

    using namespace std::chrono;
    typedef std::chrono::milliseconds ms;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();

    std::vector<std::thread> threads;
    for (int i = 0; i <numThreads; ++i) {
        std::thread t = std::thread(threadWork, i);
        threads.push_back(std::move(t));
    }

    for (size_t i = 0; i < numThreads; ++i) {
        threads[i].join();
    }
    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    ms timespan = duration_cast<ms>(t2 - t1);
    avgTime += timespan.count();
}

The call to `this_thread::sleep_for` looks suspicious to me. Also see [Multi-threading benchmark](https://stackoverflow.com/q/41388602/608639), [How to benchmark Linux threaded programs?](https://stackoverflow.com/q/11034342/608639), [Poor performance in multi-threaded C++ program](https://stackoverflow.com/q/15177726/608639), etc. — jww, Sep 23 '18 at 03:17
Thanks, will check the links! Because running the code produces diffeernt results, I just wanted to average out the duration of trials. Added `sleep_for` in the main thread to spread-out computation (in case my PC was doing something different at that moment) — Kari, Sep 23 '18 at 03:53

score 2 · Accepted Answer · answered Sep 23 '18 at 08:14

2

The following is your problem:

for(size_t i=5; i<endIx-5; ++i){
           ^^^

It should be:

for(size_t i=start_ix + 5; i<endIx-5; ++i){
           ^^^^^^^^^^^^^^

answered Sep 23 '18 at 08:14

crayzeewulf

5,840
1
27
30

1

you just made me facepalm. So all of them were using shared region. – Kari Sep 23 '18 at 13:18

Multithreading slows program: no False-sharing, no mutex, no cache misses, no small workload

1 Answers1