How fast is thread local variable access on Linux

Question

How fast is accessing a thread local variables in Linux. From the code generated by the gcc compiler, I can see that is uses the fs segment register. So apparently, the access to the thread local variable should not cost extra cycles.

However, I keep on reading horror stories about the slowness of thread local variable access. How come? Sure, sometimes different compilers use a different approach than using fs segment register, but is accessing thread local variable through fs segment register slow too?

what's going on behind the scenes: http://www.akkadia.org/drepper/tls.pdf .. does anybody feel motivation to read this and summarize it in a short answer? :D — Karoly Horvath, Mar 28 '12 at 15:53
The "horror stories" are probably from TSS (Thread Specific Storage) via pthreads_setspecific. TSS is slower than TLS, but if done properly not by a whole lot. — Zan Lynx, Mar 28 '12 at 19:57
I could give you a horror story of the slowness of a _non_ thread local variable (a simple integer counter), which was modified through several threads and slowed the system down to a crawl because of cache snooping. Making it thread local and doing a summation of all thread locals at the end was giving me a speedup of a factor of 100 or similar. — Gunther Piez, Mar 29 '12 at 08:53
drhirsch: Damn man! The tool I'm working on had exactly the same problem and I solved it exactly like you did, that is, used thread local variables instead :)! Cheers! — pythonic, Mar 30 '12 at 10:43

score 19 · Answer 1 · 2014-08-25T16:01:49.443

However, I keep on reading horror stories about the slowness of thread local variable access. How come?

Let me demonstrate the slowness of thread local variable on Linux x86_64 with an example I have taken from http://software.intel.com/en-us/blogs/2011/05/02/the-hidden-performance-cost-of-accessing-thread-local-variables.

No __thread variable, no slowness.

I will use the performance of this test as a base.

    #include "stdio.h"
    #include "math.h"

    double tlvar;
    //following line is needed so get_value() is not inlined by compiler
    double get_value() __attribute__ ((noinline));
    double get_value()
    {
      return tlvar;
    }
    int main()

    {
      int i;
      double f=0.0;
      tlvar = 1.0;
      for(i=0; i<1000000000; i++)
      {
         f += sqrt(get_value());
      }
      printf("f = %f\n", f);
      return 1;
    }

This is assembler code of get_value()

Dump of assembler code for function get_value:
=> 0x0000000000400560 <+0>:     movsd  0x200478(%rip),%xmm0        # 0x6009e0 <tlvar>
   0x0000000000400568 <+8>:     retq
End of assembler dump.

This is how fast it runs:

$ time ./inet_test_no_thread
f = 1000000000.000000

real    0m5.169s
user    0m5.137s
sys     0m0.002s

There is __thread variable in an executable (not in shared library), still no slowness.

#include "stdio.h"
#include "math.h"

__thread double tlvar;
//following line is needed so get_value() is not inlined by compiler
double get_value() __attribute__ ((noinline));
double get_value()
{
  return tlvar;
}

int main()
{
  int i;
  double f=0.0;

  tlvar = 1.0;
  for(i=0; i<1000000000; i++)
  {
    f += sqrt(get_value());
  }
  printf("f = %f\n", f);
  return 1;
}

This is assembler code of get_value()

(gdb) disassemble get_value
Dump of assembler code for function get_value:
=> 0x0000000000400590 <+0>:     movsd  %fs:0xfffffffffffffff8,%xmm0
   0x000000000040059a <+10>:    retq
End of assembler dump.

This is how fast it runs:

$ time ./inet_test
f = 1000000000.000000

real    0m5.232s
user    0m5.158s
sys     0m0.007s

So, it is quite obvious that when __thread var is in the executable it is as fast as ordinary global variable.

There is a __thread variable and it is in a shared library, there is slowness.

Executable:

$ cat inet_test_main.c
#include "stdio.h"
#include "math.h"
int test();

int main()
{
   test();
   return 1;
}

Shared library:

$ cat inet_test_lib.c
#include "stdio.h"
#include "math.h"

static __thread double tlvar;
//following line is needed so get_value() is not inlined by compiler
double get_value() __attribute__ ((noinline));
double get_value()
{
  return tlvar;
}

int test()
{
  int i;
  double f=0.0;
  tlvar = 1.0;
  for(i=0; i<1000000000; i++)
  {
    f += sqrt(get_value());
  }
  printf("f = %f\n", f);
  return 1;
}

This is assembler code of get_value(), see how different it is - it calls __tls_get_addr():

Dump of assembler code for function get_value:
=> 0x00007ffff7dfc6d0 <+0>:     lea    0x200329(%rip),%rdi        # 0x7ffff7ffca00
   0x00007ffff7dfc6d7 <+7>:     callq  0x7ffff7dfc5c8 <__tls_get_addr@plt>
   0x00007ffff7dfc6dc <+12>:    movsd  0x0(%rax),%xmm0
   0x00007ffff7dfc6e4 <+20>:    retq
End of assembler dump.

(gdb) disas __tls_get_addr
Dump of assembler code for function __tls_get_addr:
   0x0000003c40a114d0 <+0>:     push   %rbx
   0x0000003c40a114d1 <+1>:     mov    %rdi,%rbx
=> 0x0000003c40a114d4 <+4>:     mov    %fs:0x8,%rdi
   0x0000003c40a114dd <+13>:    mov    0x20fa74(%rip),%rax        # 0x3c40c20f58 <_rtld_local+3928>
   0x0000003c40a114e4 <+20>:    cmp    %rax,(%rdi)
   0x0000003c40a114e7 <+23>:    jne    0x3c40a11505 <__tls_get_addr+53>
   0x0000003c40a114e9 <+25>:    xor    %esi,%esi
   0x0000003c40a114eb <+27>:    mov    (%rbx),%rdx
   0x0000003c40a114ee <+30>:    mov    %rdx,%rax
   0x0000003c40a114f1 <+33>:    shl    $0x4,%rax
   0x0000003c40a114f5 <+37>:    mov    (%rax,%rdi,1),%rax
   0x0000003c40a114f9 <+41>:    cmp    $0xffffffffffffffff,%rax
   0x0000003c40a114fd <+45>:    je     0x3c40a1151b <__tls_get_addr+75>
   0x0000003c40a114ff <+47>:    add    0x8(%rbx),%rax
   0x0000003c40a11503 <+51>:    pop    %rbx
   0x0000003c40a11504 <+52>:    retq
   0x0000003c40a11505 <+53>:    mov    (%rbx),%rdi
   0x0000003c40a11508 <+56>:    callq  0x3c40a11200 <_dl_update_slotinfo>
   0x0000003c40a1150d <+61>:    mov    %rax,%rsi
   0x0000003c40a11510 <+64>:    mov    %fs:0x8,%rdi
   0x0000003c40a11519 <+73>:    jmp    0x3c40a114eb <__tls_get_addr+27>
   0x0000003c40a1151b <+75>:    callq  0x3c40a11000 <tls_get_addr_tail>
   0x0000003c40a11520 <+80>:    jmp    0x3c40a114ff <__tls_get_addr+47>
End of assembler dump.

It runs almost twice slower !:

$ time ./inet_test_main
f = 1000000000.000000

real    0m9.978s
user    0m9.906s
sys     0m0.004s

And finally - this is what perf reports - __tls_get_addr - 21% of CPU utilization:

$ perf report --stdio
#
# Events: 10K cpu-clock
#
# Overhead         Command        Shared Object              Symbol
# ........  ..............  ...................  ..................
#
    58.05%  inet_test_main  libinet_test_lib.so  [.] test
    21.15%  inet_test_main  ld-2.12.so           [.] __tls_get_addr
    10.69%  inet_test_main  libinet_test_lib.so  [.] get_value
     5.07%  inet_test_main  libinet_test_lib.so  [.] get_value@plt
     4.82%  inet_test_main  libinet_test_lib.so  [.] __tls_get_addr@plt
     0.23%  inet_test_main  [kernel.kallsyms]    [k] 0xffffffffa0165b75

So, as you can see when a thread local variable is in a shared library (declared static and used only in a shared library) it is rather slow. If a thread local variable in a shared library is accessed rarely, then it is not a problem for performace. If it is used quite often like in this test then the overhead will be significant.

The document http://www.akkadia.org/drepper/tls.pdf which is mentioned in the comments talks about four possible TLS access models. Frankly, I don't understand when "Initial exec TLS model" is used, but as for the other three models it is possible to avoid calling __tls_get_addr() only when __thread variable is in an executable and is accessed from the executable.

+1 for all this testing. Great. However, five nanoseconds per operation is not what I would call really slow. Its in the same order as a function call, so unless thread-local variables are virtually the only thing you do, it should never be an issue. Thread synchronization is generally much more expensive. And if you can avoid that by using thread-local storage, you have a huge win - shared library or not. — cmaster - reinstate monica, Aug 25 '14 at 16:48
You can use -ftls-model=initial-exec or __attribute((tls_model("initial-exec"))) in a shared library but then you must be very careful. It breaks dlopen and also order of shared library object loading becomes important because elf having STATIC_TLS flag set may fail to load if too many static or dynamic tls objects have been already loaded. (ie you should load static tls objects first) — Pauli Nieminen, Jun 24 '18 at 09:22

score 10 · Accepted Answer · answered Mar 28 '12 at 16:08

How fast is accessing a thread local variables in Linux

It depends, on a lot of things.

Some processors (i*86) have special segment (fs, or gs in x86_64 mode). Other processors do not (but usually they will have a register reserved for accessing current thread, and TLS is easy to find using that dedicated register).

On i*86, using fs, the access is almost as fast as direct memory access.

I keep on reading horror stories about the slowness of thread local variable access

It would have helped if you provided links to some such horror stories. Without the links, it's impossible to tell whether their authors know what they are talking about.

Horror stories? No problem: I've worked on a embedded MIPS platform where each access to thread local storage resulted in a very slow kernel call. You could do roughly 8000 TLS accesses per second on that platform. — Nils Pipenbrinck, Aug 25 '14 at 08:09

How fast is thread local variable access on Linux

2 Answers2