Why this simple program on shared variable does not scale? (no lock)

Question

I'm new to concurrent programming. I implement a CPU intensive work and measure how much speedup I could gain. However, I cannot get any speedup as I increase #threads.

The program does the following task:

There's a shared counter to count from 1 to 1000001.
Each thread does the following until the counter reaches 1000001:
- increments the counter atomically, then
- run a loop for 10000 times.

There're 1000001*10000 = 10^10 operations in total to be perform, so I should be able to get good speedup as I increment #threads.

Here's how I implemented it:

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdatomic.h>

pthread_t workers[8];
atomic_int counter; // a shared counter

void *runner(void *param);

int main(int argc, char *argv[]) {
  if(argc != 2) {
    printf("Usage: ./thread thread_num\n");
    return 1;
  }

  int NUM_THREADS = atoi(argv[1]);
  pthread_attr_t attr;

  counter = 1; // initialize shared counter
  pthread_attr_init(&attr);

  const clock_t begin_time = clock(); // begin timer
  for(int i=0;i<NUM_THREADS;i++)
    pthread_create(&workers[i], &attr, runner, NULL);

  for(int i=0;i<NUM_THREADS;i++)
    pthread_join(workers[i], NULL);

  const clock_t end_time = clock(); // end timer

  printf("Thread number = %d, execution time = %lf s\n", NUM_THREADS, (double)(end_time - begin_time)/CLOCKS_PER_SEC);

  return 0;
}

void *runner(void *param) {
  int temp = 0;

  while(temp < 1000001) {
    temp = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
    for(int i=1;i<10000;i++)
      temp%i; // do some CPU intensive work
  }

  pthread_exit(0);
}

However, as I run my program, I cannot get better performance than sequential execution!!

gcc-4.9 -std=c11 -pthread -o my_program my_program.c
for i in 1 2 3 4 5 6 7 8; do \
        ./my_program $i; \
    done
Thread number = 1, execution time = 19.235998 s
Thread number = 2, execution time = 20.575237 s
Thread number = 3, execution time = 25.161116 s
Thread number = 4, execution time = 28.278671 s
Thread number = 5, execution time = 28.185605 s
Thread number = 6, execution time = 28.050380 s
Thread number = 7, execution time = 28.286925 s
Thread number = 8, execution time = 28.227132 s

I run the program on a 4-core machine.

Does anyone have suggestions to improve the program? Or any clue why I cannot get speedup?

Be careful when using "empty" for loops like that for benchmarking. Most modern C compilers will optimize that away by removing it completely from the code (since the program basically functions the same way with or without the loop, only faster (and faster is better)). Make sure the operation in the loop has side-effects that changes the meaning of the code if the compiler tries to remove it. — slebetman, Apr 29 '15 at 02:42
Basically, assign that `temp%i` to a temporary variable or something or print it out. — slebetman, Apr 29 '15 at 02:43
Thanks @slebetman , that's really helpful!! I tried what you suggested and the for-loop was finally executed. Although, I still get similar results, I'll pay more attention to compiler optimization next time. — Ruo Chun Zeung, Apr 30 '15 at 12:09

score 0 · Accepted Answer · answered Apr 29 '15 at 02:01

The only work here that can be done in parallel is the loop:

for(int i=0;i<10000;i++)
      temp%i; // do some CPU intensive work

gcc, even with the minimal optimisation level, will not emit any code for the temp%i; void expression (disassemble it and see), so this essentially becomes an empty loop, which will execute very fast - the execution time in the case with multiple threads running on different cores will be dominated by the cacheline containing your atomic variable ping-ponging between the different cores.

You need to make this loop actually do a significant amount of work before you'll see a speed-up.

Thanks, I got it. I tried to modify the loop, doing something useful and found out what you pointed out: the atomic variable is ping-ponging between cores. With each thread competing to modify this atomic variable results in frequent invalidation of other threads' cache value. I'll try to use some more synchronization methods. — Ruo Chun Zeung, May 05 '15 at 19:53

Why this simple program on shared variable does not scale? (no lock)

1 Answers1