0

while I was benchmarking a CPU with hyperthreading with BLAS matrix operations in C, I observed a nearly exact doubling of the runtime of the functions when using hyperthreading. What I expected was some kind of speed improvement because of out of order executions or other optimizations.

I use gettimeofday to estimate the runtime. In order to evaluate the observation I want to know if you have thoughts on the stability of gettimeofday in hyperthreading environment (Debian Linux 32 Bit) or maybe on my expectations (they might be wrong)?

Update: I forgot to mention that I am running the benchmark application twice, setting the affinity to one hyperthreading core each. For example gemm is run twice in parallel.

bknux
  • 536
  • 1
  • 5
  • 18
  • If your code and data largely fit within cache (L1 especially, but maybe also in L2), which things like BLAS are designed/optimized to do, then the execution of that code will lack the majority of the pipeline stalls and bubbles within which hyperthreading schedules instructions from the other thread, which pretty much defeats hyperthreading. – twalberg Dec 12 '14 at 17:37

1 Answers1

2

I doubt whether your use of gettimeofday() explains the discrepancy, unless, possibly, you are measuring very small time intervals.

More to the point, I would not expect enabling hyperthreading to improve the performance of single-threaded BLAS computations. A single thread uses only one processor (at a time), so the additional logical processors presented by hyperthreading do not help.

A well-tuned BLAS makes good use of the CPU's data cache to reduce memory access time. That doesn't help much if the needed data are evicted from the cache, however, as is likely to happen when a different process is executed by the other logical processor of the same physical CPU. Even on a lightly-loaded system, there is probably enough work to do that the OS will have a process scheduled at all times on every available (logical) processor.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • Sorry, I forgot to mention that I am running the blas functions parallel. – bknux Dec 12 '14 at 21:59
  • 1
    You're missing the point. BLAS itself is single-threaded, so the individual computations do not benefit from additional cores being available (whether physical or logical). On the other hand, each one's cache usage is just as adversely impacted by another BLAS computation running on the same physical CPU as it is by a random unrelated computation running there. – John Bollinger Dec 12 '14 at 22:50