Why memcpy performance deteriorates when used in multible threads?

Question

I wrote a short test program on Linux to test how memcpy performs when used in multiple threads. I didn't expect it to be as devastating. Execution time went from 3.8 seconds to over 2 minutes while running two instances of the program concurrently took about 4.7 seconds. Why is this?

// thread example
#include <iostream>       
 #include <thread>         
#include <string.h>
using namespace std;

void foo(/*int a[3],int b[3]*/)
{
  int a[3]={7,8,3};
  int b[3]={9,8,2};

  for(int i=0;i<100000000;i++){
    memcpy(a,b,12*(rand()&1));
    }
}


int main()
{

#ifdef THREAD

  thread threads[4];
  for (char t=0; t<4; ++t) {
    threads[t] = thread( foo );
  }

  for (auto& th : threads) th.join();            
  cout << "foo and bar completed.\n";

#else

  foo();
  foo();
  foo();
  foo();

#endif

  return 0;
}

`rand()` returns a number between `0` and `RAND_MAX`. That multiplied by 12 is most likely to overflow your `a` and `b` buffers and render you whole experiment undefined. — rodrigo, Oct 04 '16 at 18:24
@rodrigo He is bitwise anding the result with 1 so it will either be 1 or 0. Still UB though as there can be the integer overflow. — NathanOliver, Oct 04 '16 at 18:28
@NathanOliver: Hmmm... actually I think that multiplication has higher precedence than bitwise AND... and since `12` is even, it will be always `0`! No UB after all, but no bits copied either. — rodrigo, Oct 04 '16 at 18:30
@rodrigo `*` has higher precednce then `&`: http://en.cppreference.com/w/cpp/language/operator_precedence — NathanOliver, Oct 04 '16 at 18:31
Have you timed it without the `memcpy()`? Without the `rand()`? And what makes you think that moving a random number of bytes, even if you had calculated it correctly, should yield consistent timings? — user207421, Oct 04 '16 at 18:34

score 3 · Accepted Answer · answered Oct 04 '16 at 18:40

3

Your memcpy does nothing as the 12 * rand() & 1 is always 0, because it is read as (12 * rand()) & 1. And since 12 is even, the result is always 0.

So you are simply measuring the time of rand(), but that function uses a shared global state that may (or may not) be shared by all the threads. It looks like in your implementation it is shared and its access is synchronized, so you have heavy contention and the performance suffers.

Try using rand_r() instead, that uses no shared state (or the new and improved C++ random generators):

  unsigned int r = 0;
  for(int i=0;i<100000000;i++){
       rand_r(&r)
    }

In my machine, that reduces the multithread runtime from 30s to 0.7s (the single thread was 2.2s). Naturally, this experiment says nothing about memcpy(), but it says something about shared global state...

answered Oct 04 '16 at 18:40

rodrigo

94,151
12
143
190

He should avoid `rand()` completely if he is timing something. Otherwise he is just timing things that take different times by definition, so any complaint that they do so is meaningless. The whole methodology is invalid. – user207421 Oct 04 '16 at 18:44
@EJP: Sure, it is invalid to measure what the OP wanted. But since the `memcpy()` is actually a no-op, the code is actually valid to measure the time of `rand()` in a multithreaded environment. I, for one, find the performance hit quite interesting. – rodrigo Oct 04 '16 at 18:52
Interesting but meaningless. It's meaningful if he's timing `rand()`, or method invocations, or local array initialization, or multiplication by 12 and AND-ing with 1. The one thing it isn't meaningful for is timing `memcpy()`, which is what the question is stated to be about. – user207421 Oct 04 '16 at 18:57

Why memcpy performance deteriorates when used in multible threads?

1 Answers1