2

Profiling my code, i see a lot of cache misses and would like to know whether there is a way to improve the situation. Optimization is not really needed, I'm more curious about whether there exist general approaches to this problem (this is a follow up question).

// class to compute stuff
class A {
    double compute();
    ...
    // depends on other objects
    std::vector<A*> dependencies;
}

I have a container class that stores pointers to all created objects of class A. I do not store copies as I want to have shared access. Before I was using shared_ptr, but as single As are meaningless without the container, raw pointers are fine.

class Container {
    ...
    void compute_all();
    std::vector<A*> objects;
    ...
}

The vector objects is insertion sorted in a way that the full evaluation can be done by simply iterating and calling A.compute(), all dependencies in A are resolved.

With a_i objects of class A, the evaluation might look like this:

a_1 => a_2 => a_3 --> a_2 --> a_1 => a_4 => ....

where => denotes iteration in Container and --> iteration over A::dependencies

Moreover, the Container class is created only once and compute_all() is called many times, so rearranging the whole structure after creation is an option and wouldn't harm efficiency much.

Now to the observations/questions:

  1. Obviously, iterating over Container::objects is cache efficient, but accessing the pointees is definitely not.

  2. Moreover, as each object of type A has to iterate over A::dependencies, which again can produces cache misses.

Would it help to create a separate vector<A*> from all needed object in evaluation order such that dependencies in A are inserted as copies?

Something like this:

a_1 => a_2 => a_3 => a_2_c => a_1_c => a_4 -> ....

where a_i_c are copies from a_i.

Thanks for your help and sorry if this question is confusing, but I find it rather difficult to extrapolate from simple examples to large applications.

Puppy
  • 144,682
  • 38
  • 256
  • 465
bbtrb
  • 4,065
  • 2
  • 25
  • 30

2 Answers2

1

Unfortunately, I'm not sure if I'm understanding your question correctly, but I'll try to answer.

Cache misses are caused by the processor requiring data that is scattered all over memory.

One very common way of increasing cache hits is just organizing your data so that everything that is accessed sequentially is in the same region of memory. Judging by your explanation, I think this is most likely your problem; your A objects are scattered all over the place.

If you're just calling regular new every single time you need to allocate an A, you'll probably end up with all of your A objects being scattered.

You can create a custom allocator for objects that will be creating many times and accessed sequentially. This custom allocator could allocate a large number of objects and hand them out as requested. This may be similar to what you meant by reordering your data.

It can take a bit of work to implement this, however, because you have to consider cases such as what happens when it runs out of objects, how it knows which objects have been handed out, and so on.

// This example is very simple. Instead of using new to create an Object,
// the code can just call Allocate() and use the pointer returned.
// This ensures that all Object instances reside in the same region of memory.
struct CustomAllocator {
    CustomAllocator() : nextObject(cache) { }

    Object* Allocate() {
        return nextObject++;
    }

    Object* nextObject;
    Object cache[1024];
}

Another method involves caching operations that work on sequential data, but aren't performed sequentially. I think this is what you meant by having a separate vector.

However, it's important to understand that your CPU doesn't just keep one section of memory in cache at a time. It keeps multiple sections of memory cached.

If you're jumping back and forth between operations on data in one section and operations on data in another section, this most likely will not cause many cache hits; your CPU can and should keep both sections cached at the same time.

If you're jumping between operations on 50 different sets of data, you'll probably encounter many cache misses. In this scenario, caching operations would be beneficial.

In your case, I don't think caching operations will give you much benefit. Ensuring that all of your A objects reside in the same section of memory, however, probably will.

Another thing to consider is threading, but this can get pretty complicated. If your thread is doing a lot of context switches, you may encounter a lot of cache misses.

Collin Dauphinee
  • 13,664
  • 1
  • 40
  • 71
0

+1 for profiling first :)

While using a cusomt allocator can be the correct solution, I'd certainly recommend two things first:

  • keep a reference/pointer to the entire vector of A instead of a vector of A*:

.

class Container {
    ...
    void compute_all();
    std::vector<A>* objects;
    ...
}
  • Use a standard library with custom allocators (I think boost has some good ones, EASTL is centered around the very concept)

$0.02

sehe
  • 374,641
  • 47
  • 450
  • 633