LockFreeStack benchmark (oversubscription state)

Question

I've implemented lock-free Stack, based on 'Concurrency in action' book example. I wanted to benchmark it and compare it to other lock-free stacks, i.e from boost::lockfree. I used google benchmark framework to conduct those tests, measuring single operation time under different contention (by operation I mean push/pop, which were invoked in random order).

Run on (8 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 6144K (x1)
----------------------------------------------------------------------------------
Benchmark                                           Time           CPU Iterations
----------------------------------------------------------------------------------
    BM_lockFreeStack/real_time/threads:1              136 ns        136 ns    5145339
    BM_lockFreeStack/real_time/threads:2              184 ns        367 ns    3785648
    BM_lockFreeStack/real_time/threads:4              207 ns        820 ns    3361952
    BM_lockFreeStack/real_time/threads:8              209 ns       1639 ns    3387024
    BM_lockFreeStack/real_time/threads:16             167 ns        957 ns    4269504
    BM_lockFreeStack/real_time/threads:32             150 ns        590 ns    4866592
    BM_boostLockFreeStack/real_time/threads:1          66 ns         66 ns   10510435
    BM_boostLockFreeStack/real_time/threads:2         133 ns        265 ns    5713306
    BM_boostLockFreeStack/real_time/threads:4         122 ns        475 ns    5809292
    BM_boostLockFreeStack/real_time/threads:8         128 ns        944 ns    5432072
    BM_boostLockFreeStack/real_time/threads:16        129 ns        989 ns    5461120
    BM_boostLockFreeStack/real_time/threads:32        129 ns       1017 ns    5447776

As you can see, I used processor with 8 threads. What is suprising for me, is the results for 16/32 threads (lockFreeStack), where average operation time is shorter than results for 2/4/8 threads. This kind of results are consistent every time i run those tests.

Is there any logical explanation for this behaviour?

If the full/empty or other failure cases are faster, are they happening more often when you have more SW threads than the HW can run at once, so some are blocked? Can a partially-complete push or pop by a descheduled thread leave it in an always-fails state for other threads? You should at least link to your code or algo design, otherwise we can't analyze the behaviour any more than this. — Peter Cordes, Jun 25 '18 at 01:14
What actual hardware do you have? From the caches, looks like a quad-core Intel with hyperthreading, but is it Sandybridge or Skylake? Probably doesn't matter; I'm not aware of any changes in hardware arbitration for `lock`ed operations within SnB-family, so there probably aren't any significant changes. — Peter Cordes, Jun 25 '18 at 01:17

LockFreeStack benchmark (oversubscription state)

0 Answers0