Inconsistency in performance of empty kernel with multi threads

Question

Below code is to check the performance of the empty kernels (to verify the dispatch rate of the kernel) with multi threads using std async.

#include <stdio.h>
#include <stddef.h>
#include <chrono>
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
#include <algorithm>
#include <atomic>
#include <thread>
#include <future>
#include <functional>


#define WARMUP_RUN_COUNT 10
#define TIMING_RUN_COUNT 100
#define TOTAL_RUN_COUNT WARMUP_RUN_COUNT + TIMING_RUN_COUNT

__global__ void EmptyKernel() {}

void print_timing(std::string test, std::array<float, TOTAL_RUN_COUNT> &results, int batch = 1)
{

    float total_us = 0.0f, mean_us = 0.0f, stddev_us = 0.0f;

    // remove top outliers due to nature of variability across large number of multi-threaded runs
    std::sort(results.begin(), results.end(), std::greater<float>());
    auto start_iter = std::next(results.begin(), WARMUP_RUN_COUNT);
    auto end_iter = results.end();

    // mean
    std::for_each(start_iter, end_iter, [&](const float &run_ms) {
        total_us += (run_ms * 1000) / batch;
    });
    mean_us = total_us  / TIMING_RUN_COUNT;

   // stddev
    total_us = 0;
    std::for_each(start_iter, end_iter, [&](const float &run_ms) {
        float dev_us = ((run_ms * 1000) / batch) - mean_us;
        total_us += dev_us * dev_us;
    });
    stddev_us = sqrt(total_us / TIMING_RUN_COUNT);

    printf("\n %s: %.1f us, std: %.1f us\n", test.c_str(), mean_us, stddev_us);
}

void kernel_enqueue_rate(std::atomic_int* shared, int max_threads)
{
    //resources necessary for this thread
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    std::array<float, TOTAL_RUN_COUNT> results;

    //synchronize all threads, before running
    int tid = shared->fetch_add(1, std::memory_order_release);
    while (max_threads != shared->load(std::memory_order_acquire)) {}

    for (auto i = 0; i < TOTAL_RUN_COUNT; ++i) {
        auto start = std::chrono::high_resolution_clock::now();
        EmptyKernel<<<1, 1, 0, stream>>>();
        auto stop = std::chrono::high_resolution_clock::now();
        results[i] = std::chrono::duration<double, std::milli>(stop - start).count();
    }
    print_timing("Thread ID : " + std::to_string(tid) + " , " + "Kernel enqueue rate", results);
}


// Simple thread pool
struct thread_pool {
    thread_pool(int total_threads) : max_threads(total_threads) {}
    void start(std::function<void(std::atomic_int*, int)> f) {
        for (int i = 0; i < max_threads; ++i) {
            threads.push_back(std::async(std::launch::async, f, &shared, max_threads));
        }
    }
    void finish() {
        for (auto&&thread : threads) {
            thread.get();
        }
        threads.clear();
        shared = {0};
    }
    ~thread_pool() {
        finish();
    }
private:
    std::atomic_int shared {0};
    std::vector<std::future<void>> threads;
    int max_threads = 1;
};

int main(int argc, char* argv[])
{
    int max_threads = 4;

    thread_pool task(max_threads);

    task.start(kernel_enqueue_rate);
    task.finish();

}

The observation is that few threads takes much more time than the other threads, for example in the below run, 2 threads take approx 6 us but the other 2 threads take close to or more than 10 us.

 Thread ID : 0 , Kernel enqueue rate enqueue rate: 9.5 us, std: 9.3 us

 Thread ID : 2 , Kernel enqueue rate enqueue rate: 5.7 us, std: 2.9 us

 Thread ID : 1 , Kernel enqueue rate enqueue rate: 11.7 us, std: 7.3 us

 Thread ID : 3 , Kernel enqueue rate enqueue rate: 6.0 us, std: 2.1 us

what is the reason for this significant difference and is there a way to avoid this and get similar results on all the threads.

Kernel launch is asynchronous, so you may want to synchronize before setting `stop` — Ander Biguri, Feb 25 '20 at 12:13
@AnderBiguri This makes sense. Since timings are measured on CPU and kernel launches are async , we need to sync it so that we get the exact rate on all the threads.. Is it right. — Satyanvesh D, Feb 25 '20 at 13:28
Yes. Basically, once your kernel is launched, the CPU will run the next line, regardless if the kernel has finished or not. Add `cudaDeviceSynchronize()` to enforce waiting the kernel to return. — Ander Biguri, Feb 25 '20 at 13:40
Apologies, you will want to syncronize the streams, not the device. — Ander Biguri, Feb 25 '20 at 13:57
But here we have only 1 stream. So whether we do device sync or the stream sync, shouldn't it be same. — Satyanvesh D, Feb 25 '20 at 17:37
Does pinning the threads to cores help in getting more uniform performance? — Curious, Feb 25 '20 at 18:02
@Curious tried by running the program as below: taskset -c 5,6 ./a.out to restrict the process to cores 5,6, and most of the times the performance is uniform (even with or without stream synchronize). Is this the right way to pin the threads or do we need to change the implementation itself and just run as a.out, and why is this resulting in uniform performance. — Satyanvesh D, Feb 25 '20 at 18:33
@SatyanveshD This is one of the ways to pin the threads. Alternatively, you could pin the threads programmatically. A useful reference is provided in [this](https://eli.thegreenplace.net/2016/c11-threads-affinity-and-hyperthreading/) post by Eli Bendersky. I suspect that the threads reporting longer running times before were migrating across cores more than the others. By limiting the 4 threads threads to 2 cores, you are likely constraining them to similar amount of resource sharing resulting in uniform performance. I expect the performance to remain uniform but improve if you use 4 cores. — Curious, Feb 25 '20 at 18:52
@Curious Thanks. Firstly, if i use 4 cores instead of 2 with taskset, the performance is not as good as compared to 2, i guess this might be due to thread migration across 4 cores. is that correct. Secondly, i have tried to pin the threads programmatically and if i pin each thread to a different cpu uniquely, the performance improvement is little, but if i pin all the threads to the same cpu, i see much better as well as uniform performance consistently. So, the other cpus are not used even when available. Is this right way to do. Or it depends on the configuration and environment. — Satyanvesh D, Feb 26 '20 at 16:22
@SatyanveshD How many cores are there on the system? Your second observation might also be related to the number of cores on the system. If you programmatically pin the 4 threads to 2 cores (instead of 4), do you see identical performance to using `taskset` with 2 cores? My hypotheses are 1) pinning cores improves launch latency by preventing thread migration 2) pinning on cores beyond a certain number is counterproductive due to interference with the kernel 3) the performance sweet spot in choosing the number of cores to pin should be identical between `taskset` and programmatic pinning. — Curious, Feb 26 '20 at 17:16
@Curious There are 8 cpus on the system (from /proc/cpuinfo). I am yet to verify by programatically pinning 4 threads to 2 cores. May be when i pin to 4 cores with taskset threads are still migrating leading to non uniform performance. Also, can you please elaborate your 3rd point. — Satyanvesh D, Feb 26 '20 at 18:26
@SatyanveshD `cat /proc/cpuinfo | grep "cpu cores" | uniq` should give the number of physical cores. Are there 8 cores including hyperthreading? In my 3rd point, I meant both `taskset` and programmatic pinning should give equally good or bad performance. Another interesting data point is when you pin to three cores and how it compares to pinning to two and four cores. — Curious, Feb 27 '20 at 03:58
@Curious 1) I can see 4 cores if i grep for cpu cores. But there are 8 processors from 0 to 7 if i just do /proc/cpuinfo. In all the 8 cpus , I see 4 as cpu cores. But I can run from 0 to 7 with taskset. That means 8 cores are available for pinning right. 2) Also, if i run programmatically, i see more uniformity if i map to just 1 core, and the uniformity decreases gradually if i increase the affinity to 2, 3, and 4 cores. — Satyanvesh D, Feb 27 '20 at 08:31
adding 3rd point here due to the length, 3) I have verified both taskset and programmatically, I see the actual enque rates are reported slightly higher with taskset for all the threads. Though the difference in uniformity between all the threads looks similar with taskset as well as programatically. — Satyanvesh D, Feb 27 '20 at 08:32
@SatyanveshD Thanks for sharing the experimental results. I have posted an answer by summarizing the discussions and your findings. If you agree, please accept it. — Curious, Feb 27 '20 at 17:35
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/208664/discussion-between-satyanvesh-d-and-curious). — Satyanvesh D, Feb 27 '20 at 18:12

score 1 · Accepted Answer · answered Feb 27 '20 at 17:33

what is the reason for this significant difference

The threads are migrating across cores and contending for the cores alongside other processes. The interference effects are not uniform across threads.

is there a way to avoid this and get similar results on all the threads

By pinning the threads to cores which can be done using taskset or programmatically as explained here

Inconsistency in performance of empty kernel with multi threads

1 Answers1