13

I want to learn how to write better code that takes advantage of the CPU's cache. Working with contiguous memory seems to be the ideal situation. That being said, I'm curious if there are similar improvements that can be made with non-contiguous memory, but with an array of pointers to follow, like:

struct Position {
    int32_t x,y,z;
}
...
std::vector<Position*> posPointers;
...
updatePosition () {
    for (uint32_t i = 0; i < posPointers.size(); i++) {
        Position& nextPos = *posPointers[i];
        nextPos.x++;
        nextPos.y++;
        nextPos.z++;
    }
}

This is just some rough mock-up code, and for the sake of learning this properly let's just say that all Position structs were created randomly all over the heap.

Can modern, smart, processors such as Intel's i7 look ahead and see that it's going to need X_ptr's data very shortly? Would the following line of code help?

... // for loop
Position& nextPos1 = *posPointers[i];
Position& nextPos2 = *posPointers[i+1];
Position& nextPos3 = *posPointers[i+2];
Position& nextPos4 = *posPointers[i+3];
... // Work on data here

I had read some presentation slides that seemed to indicate code like this would cause the processor to prefetch some data. Is that true? I am aware there are non-standard, platform specific, ways to call prefetching like __builtin_prefetch, but throwing that all over the place just seems like an ugly premature optimization. I am looking for a way I can subconsciously write cache-efficient code.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
Jonathan
  • 752
  • 1
  • 9
  • 19
  • Code like this is unlikely to perform well, it is very cache unfriendly and doesn't auto-vectorize. The simple fix is `std::vector`, make a copy. – Hans Passant Mar 02 '13 at 11:51
  • 2
    Creating that copy would be just as cache-inefficient. You still need to gather objects from all across memory. And if the results need to be stored back, making a copy would be even worse. – MSalters Mar 02 '13 at 15:12

2 Answers2

6

I know you didn't ask (and probably don't need a sermon on proper treatment of caches, but I thought I'd contribute my two cents anyways. Note that all this only applies in hot code. Remember that premature optimization is the root of all evil.

As has been pointed out in the comments, the best way is to have containers of actual data. Generally speaking, flat data structures are much preferable to "pointer spaghetti", even if you have to duplicate some data and/or pay a price for resizing/moving/defragmenting your data structures.

And as you know, flat data structures (e.g. an array of data) only pay off if you access them linearly and sequentially most of the time.

But this strategy may not always be usable. In lieu of actual linear data, you can use other strategies like employing pool allocators, and iterating over the pools themselves, instead of over the vector holding the pointers. This of course has its own disadvantages and can be a bit more complicated.

I'm sure you know this already, but it bears mentioning again that one of the most effective techniques for getting most out of your cache is having smaller data! In the above code, if you can get away with int16_t instead of int32_t, you should definitely do so. You should pack your many bools and flags and enums into bit-fields, use indexes instead of pointers (specially on 64-bit systems,) use fixed-size hash values in your data structures instead of strings, etc.

Now, about your main question that whether the processor can follow random pointers around and bring the data into cache before they are needed. To a very limited extent, this does happen. As probably you know, modern CPUs employ a lot of tricks to increase their speed (i.e. increase their instruction retire rate.) Tricks like having a store buffer, out-of-order execution, superscalar pipelines, multiple functional units of every kind, branch prediction, etc. Most of the time, these tricks all just help the CPU to keep executing instructions, even if the current instructions have stalled or take too long to finish. For memory loads (which is the slowest thing to do, iff the data is not in the cache,) this means that the CPU should get to the instruction as soon as possible, calculate the address, and request the data from the memory controller. However, the memory controller can have only a very limited number of outstanding requests (usually two these days, but I'm not sure.) This means that even if the CPU did very sophisticated stuff to look ahead into other memory locations (e.g. the elements of your posPointers vector) and deduce that these are the addresses of new data that your code is going to need, it couldn't get very far ahead because the memory controller can have only so many requests pending.

In any case, AFAIK, I don't think that CPUs actually do that yet. Note that this is a hard case, because the addresses of your randomly distributed memory locations are themselves in memory (as opposed to being in a register or calculable from the contents of a register.) And if the CPUs did it, it wouldn't have that much of an effect anyways because of memory interface limitations.

The prefetching technique you mentioned seems valid to me and I've seen it used, but it only yields noticeable effect if your CPU has something to do while waiting for the future data to arrive. Incrementing three integers takes a lot less time than loading 12 bytes from memory (loading one cache line, actually) and therefor it won't mean much for the execution time. But if you had something worthwhile and more heavyweight to overlay on top of the memory prefetches (e.g. calculating a complex function that doesn't require data from memory!) then you could get very nice speedups. You see, the time to go through the above loop is essentially the sum of the time of all the cache misses; and you are getting the coordinate increments and the loop bookkeeping for free. So, you'd have won more if the free stuff were more valuable!

yzt
  • 8,873
  • 1
  • 35
  • 44
5

Modern processors have hardware prefetching mechanisms: Intel Hardware prefetcher. They infer stride access patterns to memory and prefetch memory locations that are likely to be accessed in the near future.

However in the case of totally random pointer chasing such techniques can not help. The processor does not know that the program in execution is performing pointer chasing, therefore it can not prefetch accordingly. In such cases hardware mechanisms are detrimental for performance as they would prefetch values that are not likely to be used.

The best that you can do is try to organize you data structures in memory in such a way that accesses to contiguous portions of memory are more likely.

igon
  • 3,016
  • 1
  • 22
  • 37
  • BTW, the guide that @Pradheep suggested is very good although it does not cover such details. – igon Mar 02 '13 at 05:02