I was thinking of having a per cache (or per core, assuming that every core has it's own cache in the target architecture) mutex and data to try and optimize a concurrent algorithm, with the goal being to minimize cache flushing and misses across cores. The goal I had in mind was to try and reduce cache misses for other threads and allow more concurrency and performance in the system. What is the usual strategy to achieving something like this in C++? How can I try and detect which cache line a thread is going to access the mutex and thread data stored in the cache closest to that thread?
I have heard of people doing things like this with concurrent algorithms but I don't have any idea where to start trying to implement something like this.
For example I see this in the linux man pages - http://man7.org/linux/man-pages/man2/getcpu.2.html, this leads me to think that these sorts of optimizations are done in practice.
(This might be too broad of a question. I will be willing to move it to another site, change the tags or drop the question entirely if people think so, let me know)