4

I'm currently learning about pthreads in C and came across the issue of False Sharing. I think I understand the concept of it and I've tried experimenting a bit.

Below is a short program that I've been playing around with. Eventually I'm going to change it into a program to take a large array of ints and sum it in parallel.

#include <stdio.h>
#include <pthread.h>

#define THREADS 4
#define NUMPAD 14

struct s
{
  int total; // 4 bytes
  int my_num; // 4 bytes
  int pad[NUMPAD]; // 4 * NUMPAD bytes
} sum_array[4];

static void *worker(void * ind) {
    const int curr_ind = *(int *) ind;
    for (int i = 0; i < 10; ++i) {
      sum_array[curr_ind].total += sum_array[curr_ind].my_num;
    }
    printf("%d\n", sum_array[curr_ind].total);
    return NULL;
}

int main(void) {
    int args[THREADS] = { 0, 1, 2, 3 };
    pthread_t thread_ids[THREADS];

    for (size_t i = 0; i < THREADS; ++i) {
        sum_array[i].total = 0;
        sum_array[i].my_num = i + 1;
        pthread_create(&thread_ids[i], NULL, worker, &args[i]);
    }

    for (size_t i = 0; i < THREADS; ++i) {
        pthread_join(thread_ids[i], NULL);
    }
}

My question is, is it possible to prevent false sharing without using padding? Here struct s has a size of 64 bytes so that each struct is on its own cache line (assuming that the cache line is 64 bytes). I'm not sure how else I can achieve parallelism without padding.

Also, if I were to sum an array of a varying size between 1000-50,000 bytes, how could I prevent false sharing? Would I be able to pad it out using a similar program? My current thoughts are to put each int from the big array, into an array of struct s and then use parallelism to sum it. However I'm not sure if this is the optimal solution.

Ardembly
  • 213
  • 1
  • 2
  • 9
  • You could also use __attribute__((aligned(64))) in GNU C or C++ or __declspec(align(64)) for MSVC – avatli Sep 25 '18 at 11:34

1 Answers1

2

Partition the problem: In worker(), sum into a local variable, then add the local variable to the array:

static void *worker(void * ind) {
    const int curr_ind = *(int *) ind;
    int localsum = 0;
    for (int i = 0; i < 10; ++i) {
      localsum += sum_array[curr_ind].my_num;
    }
    sum_array[curr_ind].total += localsum;
    printf("%d\n", sum_array[curr_ind].total);
    return NULL;
}

This may still have false sharing after the loop, but that is one time per thread. Thread creation overhead is much more significant than a single cache-miss. Of course, you probably want to have a loop that actually does something time-consuming, as your current code can be optimized to:

static void *worker(void * ind) {
    const int curr_ind = *(int *) ind;
    int localsum = 10 * sum_array[curr_ind].my_num;
    sum_array[curr_ind].total += localsum;
    printf("%d\n", sum_array[curr_ind].total);
    return NULL;
}

The runtime of which is definitely dominated by thread creation and synchronization in printf().

EOF
  • 6,273
  • 2
  • 26
  • 50