Processor core local storage in C++

Question

I was thinking of having a per cache (or per core, assuming that every core has it's own cache in the target architecture) mutex and data to try and optimize a concurrent algorithm, with the goal being to minimize cache flushing and misses across cores. The goal I had in mind was to try and reduce cache misses for other threads and allow more concurrency and performance in the system. What is the usual strategy to achieving something like this in C++? How can I try and detect which cache line a thread is going to access the mutex and thread data stored in the cache closest to that thread?

I have heard of people doing things like this with concurrent algorithms but I don't have any idea where to start trying to implement something like this.

For example I see this in the linux man pages - http://man7.org/linux/man-pages/man2/getcpu.2.html, this leads me to think that these sorts of optimizations are done in practice.

(This might be too broad of a question. I will be willing to move it to another site, change the tags or drop the question entirely if people think so, let me know)

This doesn't sound like something application code has any access to. Cache lines are very low level, transparent to user code. — Barmar, Dec 23 '17 at 05:52
@Barmar I know that C++ recently added constants that let you access the size of the L1 cache on architectures. Since that is possible, I feel like this should be possible as well? http://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size — Curious, Dec 23 '17 at 05:55
There are some reader writer mutex implementations out there that talk about trying to reduce cache misses by having "cache local" storage etc. So it must be possible in some way or form? — Curious, Dec 23 '17 at 05:56
The main answer is to maximize memory locality, that is to ensure that items that are going to be accessed close in time are located close in memory (preferably on the same cache line), and to minimize cross-core writes to the same line. Beyond those items chips vary so widely that you would need to read about the particular target processor (note I do mean processor, this changes even for different processors in the same family). — SoronelHaetir, Dec 23 '17 at 05:57
Those constants are very crude. All they tell you is if the addresses are more than that much apart, they're in different cache lines. But unless you implement your own memory management, there's not much you can do with them. — Barmar, Dec 23 '17 at 05:58
Forcing things to be in the same cache line is harder. Just because they're close together doesn't mean they're in the same cache line, because they could be on opposite sides of the border between cache lines. — Barmar, Dec 23 '17 at 06:00
@SoronelHaetir Right, do you think what you said is in line with what I asked? I thought what I wanted was similar. If you think I was asking to do something completely orthogonal, I should probably edit my question. What do you think? — Curious, Dec 23 '17 at 06:00
@Barmar I had heard someone talk about either Facebook or Google that has a mutex class do something like this, but I have not been able to figure out how one even attempts trying to minimize cross core writes in the presence of many threads. — Curious, Dec 23 '17 at 06:01
"What is the usual strategy" -- this isn't something programmers generally worry about at all, so there's no usual strategy. — Barmar, Dec 23 '17 at 06:01
@Barmar what I meant is, what is the "usual" (or rather, "any at all") strategy for programmers that have gotten in the situation where they implement an optimization in concurrent algorithms to minimize cache flushing? — Curious, Dec 23 '17 at 06:02
As @SoronelHaetir said, just try to maximize locality of data, and let the hardware and OS deal with the rest. This means you may have to implement some of your own memory management: instead of allocating lots of little objects, allocate an array and carve it up into sub-objects. — Barmar, Dec 23 '17 at 06:04
@Barmar that is somewhat on the lines of what I was thinking. I can get far enough to try and have `n` subobjects where `n` is the number of cores the system has (`std::thread::hardware_concurrency`), but I can't get any further. How would one detect which sub object a thread should access? Given that I don't want that thread accessing data on another core (and cause a cache flush) — Curious, Dec 23 '17 at 06:06
What you do is allocate an array for each thread, and it uses that memory for its per-thread data. But you'll have to design all your own data structures for this, you won't be able to use things like `std::vector`. It will be more like C than C++. — Barmar, Dec 23 '17 at 06:11
If you just use local variables, they'll be allocated in the stack, which is already a per-thread block of memory. — Barmar, Dec 23 '17 at 06:13
@Barmar hmmm, but thread local storage doesn't really buy me what I was looking for right? I wanted different threads to be able to run, and during "pause points" where some sort of scheduling is needed (for example run this task next) I wanted those threads to look for threads that share the cache with the current one. And which have data closeby, or something similar to that. — Curious, Dec 23 '17 at 06:13
As I mentioned above, I don't think there's any way to tell which memory is in the same cache line. — Barmar, Dec 23 '17 at 06:15
@Barmar I added something to the question that might serve as the motivation behind wanting something like this — Curious, Dec 23 '17 at 06:45
I'm not sure how that's relevant. It's just an informative call, it doesn't let you control anything, and the information can change as soon as you return. — Barmar, Dec 23 '17 at 06:55

Processor core local storage in C++

0 Answers0