Comparing performance of various pthread constructs

Question

I need to compare the performance of various pthread constructs like mutex, semaphores, read-write locks and also the corresponding serial programs, by designing some experiments. The main problem is deciding how to measure the execution time of the code for the analysis ?

I have read about some C functions like clock(), gettimeofday() etc. From what I could understand - we can use clock() to get the actual number of CPU cycles used by a program (by subtracting value returned by the function at the start and end of the code whose time we want to measure), gettimeofday() returns the wall-clock time for the execution of the program.

But the problem is total CPU cycles does not appear to be a good criteria to me as it would sum the CPU time taken across all the parallel running threads (so clock() is not good according to me). Also wall-clock time is not good since there might be other processes running in the background, so the time finally depends on how the threads get scheduled (so gettimeofday() is also not good according to me).

Some other functions that I know of also do more likely the same as the two of above. So, I wanted to know if there is some function which I can use for my analysis or am I wrong somewhere in my conclusion above ?

how log is your execution ? what is your OS ? If you want to compare mono/multi thread compare real time not cpu time — bruno, Apr 01 '19 at 08:06
and how long is execution time ? how much cpu/core you have ? — bruno, Apr 01 '19 at 08:08
I have to compare it for various input sizes - say for example I have to sum an array, then I have to vary the size which could be like 10^7, 10^8, 10^9. — him, Apr 01 '19 at 08:10
to sum an array in multi thread you do not need mutex etc, each thread will just sum a part of the array then you will sum the intermediate sums — bruno, Apr 01 '19 at 08:34
You should show the code from your tries. Concerning the way to measure time, you should use either `clock_getttime()` or `__rdtsc()`. Do not forget to disable CPU frequency changes. Always use at least `-O2` on your compiler. Perform several measures and use statistical methods to remove outliers: a trimmed average or even the minimum value that is simpler and leads to more stable results. — Alain Merigot, Apr 01 '19 at 08:41
I didn't mean that I am using mutexes to sum an array, it was just to tell about the possible sizes of inputs to my program — him, Apr 02 '19 at 16:19

bruno · Answer 1 · 2019-04-01T11:08:52.743

I am not sure to sum an array is a good test, you do not need any mutex etc to sum an array in multi thread, each thread just have to sum a dedicated part of the array, and there are a lot of memory accesses for few CPU computation. Example (the value of SZ and NTHREADS are given when compiling ), the measured time is the real time (monotonic) :

#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>

static int Arr[SZ];

void * thSum(void * a)
{
  int s = 0, i;
  int sup = *((int *) a) + SZ/NTHREADS;

  for (i = *((int *) a); i != sup; ++i)
    s += Arr[i];

  *((int *) a) = s;
}

int main()
{
  int i;

  for (i = 0; i != SZ; ++i)
    Arr[i] = rand();

  struct timespec t0, t1;

  clock_gettime(CLOCK_MONOTONIC, &t0);

  int s = 0;

  for (i = 0; i != SZ; ++i)
    s += Arr[i];

  clock_gettime(CLOCK_MONOTONIC, &t1);
  printf("mono thread : %d %lf\n", s,
         (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec)/1000000000.0);

  clock_gettime(CLOCK_MONOTONIC, &t0);

  int n[NTHREADS];
  pthread_t ths[NTHREADS];

  for (i = 0; i != NTHREADS; ++i) {
    n[i] = SZ / NTHREADS * i;
    if (pthread_create(&ths[i], NULL, thSum, &n[i])) {
      printf("cannot create thread %d\n", i);
      return -1;
    }
  }

  int s2 = 0;

  for (i = 0; i != NTHREADS; ++i) {
    pthread_join(ths[i], NULL);
    s2 += n[i];
  }

  clock_gettime(CLOCK_MONOTONIC, &t1);
  printf("%d threads : %d %lf\n", NTHREADS, s2,
         (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec)/1000000000.0);
}

Compilations and executions:

(array of 100.000.000 elements)

/tmp % gcc -DSZ=100000000 -DNTHREADS=2 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035217
2 threads : 563608529 0.020407
/tmp % ./a.out
mono thread : 563608529 0.034991
2 threads : 563608529 0.022659
/tmp % gcc -DSZ=100000000 -DNTHREADS=4 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035212
4 threads : 563608529 0.014234
/tmp % ./a.out
mono thread : 563608529 0.035184
4 threads : 563608529 0.014163
/tmp % gcc -DSZ=100000000 -DNTHREADS=8 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035229
8 threads : 563608529 0.014971
/tmp % ./a.out
mono thread : 563608529 0.035142
8 threads : 563608529 0.016248

(array of 1000.000.000 elements)

/tmp % gcc -DSZ=1000000000 -DNTHREADS=2 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.343761
2 threads : -1471389927 0.197303
/tmp % ./a.out
mono thread : -1471389927 0.346682
2 threads : -1471389927 0.197669
/tmp % gcc -DSZ=1000000000 -DNTHREADS=4 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.346859
4 threads : -1471389927 0.130639
/tmp % ./a.out
mono thread : -1471389927 0.346506
4 threads : -1471389927 0.130751
/tmp % gcc -DSZ=1000000000 -DNTHREADS=8 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.346954
8 threads : -1471389927 0.123572
/tmp % ./a.out
mono thread : -1471389927 0.349652
8 threads : -1471389927 0.127059

As you can see even the execution time is not divided by the number of threads, the bottleneck is probably the access to the memory

You should not use `gettimeofday()` for performance measurements. Any ntp sync will ruin you measures. — Alain Merigot, Apr 01 '19 at 09:36
@AlainMerigot there are variations in the measured execution time but probably not because of ntp, the clocks are good enough and the catching up small. For me the real time must be measured rather than the CPU time — bruno, Apr 01 '19 at 10:24

KamilCuk · Answer 2 · 2019-04-01T09:26:34.673

0

From linux clock_gettime:

   CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12)
          Per-process CPU-time clock (measures CPU time consumed by all
          threads in the process).

   CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12)
          Thread-specific CPU-time clock.

I believe clock() was somewhere implemented as clock_gettime(CLOCK_PROCESS_CPUTIME_ID, but I see it's implemented using times() in glibc.

So if you want to measure thread-specific CPU-time you can use clock_gettimer(CLOCK_THREAD_CPUTIME_ID, ... on GNU/Linux systems.

Never use gettimeofday nor clock_gettime(CLOCK_REALTIME to measure the execution of a program. Don't even think about that. gettimeofday is the "wall-clock" - you can display it on the wall in your room. If you want to measure the flow of time, forget gettimeofday.

If you want, you can also even stay fully posixly compatible, by using pthread_getcpuclockid inside your thread and using it's returned clock_id value with clock_gettime.

edited Apr 01 '19 at 09:26

answered Apr 01 '19 at 09:21

KamilCuk

120,984
8
59
111

all depends on what you want to measure, for me it is the real time because this is the time I feel, if i need 1min to load my program I don't care it needs 1sec to execute, for me the time is 1min1sec not 1sec ;-) – bruno Apr 01 '19 at 10:32
1

Then use `CLOCK_MONOTONIC`, not `gettimeofday`. `gettimeofday` is wall clock, not "measure interval clock". It can jump. If you use `gettimeofday` to measure the execution of your program, don't be surprised to see a negative time interval. Or wrong interval. It can jump. `gettimeofday`is only for nice looking user clock time synchronized with UTC. Because once a leap second kick in and your measurements will be wrong. Or `ntp` kicks in and synchronizes the system - and your measurements will be wrong. – KamilCuk Apr 01 '19 at 10:35

Comparing performance of various pthread constructs

2 Answers2