13

I'm currently profiling an application with performance problems using Valgrind's "Callgrind". In looking at the profiling data, it appears that a good 25% of processing time is being spent inside of boost::detail::get_tss_data in an application whose primary purpose is physics simulation and visualization.

get_tss_data is apparently called by thread_specific_ptr::get

Does anyone see this as expected? Does it generally imply something else specific?

Edit:

My platform is: Linux-2.6.32, x86, GCC 4.4.3, libc6-2.11.1/libpthread-2.11.1

Catskul
  • 17,916
  • 15
  • 84
  • 113
  • Which host platform, OS, compiler and settings are in use? For example, thread local storage could be implemented with a quickly accessible special register (e.g FS/GS selectors on x86-32), or a slow system call (platforms where these tricks haven't yet been implemented). So... what is yours? – John Ripley Mar 23 '11 at 04:04
  • I'm using GCC 4.4.3, libc6-2.11.1/libpthread-2.11.1 in linux x86 – Catskul Mar 23 '11 at 18:25
  • We need to know what was the majority caller of boost::detail::get_tss_data to get a clearer picture of why so much time is being spent here. It would help if you can paste the last piece of your own code in the call graph before it goes there. – John Ripley Mar 26 '11 at 01:50

2 Answers2

4

thread_specific_ptr uses pthread_setspecific/pthread_getspecific for POSIX systems which is not the fastest possible.

If you are on a POSIX system, you can use the __thread storage specifier. However, it can only be used with initializers that are constant expressions e.g gcc's __thread

For Windows, a similar specifier is _declspec(thread).

ipapadop
  • 1,432
  • 14
  • 19
  • 2
    chances are that even things like __thread will make system calls and will possibly be a bit slow, so try caching thread local storage as much as possible in normal stack or heap variables. – doron Mar 23 '11 at 12:15
  • With C++11 you can now use the `thread_local` specifier on most compilers to have the same semantics as with `boost::thread_specific_ptr` – ipapadop Feb 25 '15 at 21:00
1

Obtaining thread local data will most probably involve a system call. System calls jump to an interrupt vector as well as now having to read kernel memory. All this kills the cache.

For this reason reading thread local data can much longer than a normal variable read. For this reason is may well be a good idea to cache thread local data some local variable an not make frequent accesses to thread local storage.

doron
  • 27,972
  • 12
  • 65
  • 103
  • 3
    Accessing thread local storage should not be expensive on i386, x86-64 or even armv6/7 with a modern gcc/glibc. It's just an offset from a special register that's part of the thread state and maintained by the kernel across context switches. It's even part of the ABI that there are special registers to make this fast without needing a syscall. – John Ripley Mar 23 '11 at 04:09
  • True you can reserve a register for a TLS offset but this changes the ABI and is incompatible the non-TLS ABI (since it now becomes necessary to preserve the TLS register before calling functions). I am not sure of x86 ABIs but ARMs EABI does NOT require preservation of a TLS register (see http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042d/IHI0042D_aapcs.pdf). I may be wrong but as far as I remember gcc implements __thread using a syscall. – doron Mar 23 '11 at 10:01
  • 1
    ARM has 'TPIDRURO' which is accessed via cp13, which is single issue on all implementations I know about. It doesn't need preserving due to its location. I don't know about Linux but iOS uses it. I would be surprised if a modern Linux distro was using the non-TLS ABI - precisely because of this performance issue - but stranger things have happened. – John Ripley Mar 23 '11 at 16:44
  • @John, this was only added in ARMv7. It is there for thread identification information but I guess it can be used for a base address for TLS – doron Mar 24 '11 at 00:19
  • 2
    Just checked eglibc-2.11 source. Both i386 and x86_64 use FS/GS selectors to get thread-specific data. There are no syscalls involved. ARM-specific stuff seems to be missing, though - but that's not the arch in question here. – John Ripley Mar 24 '11 at 02:16