Cache prefetching scenario - power architecture

Question

I'm using the asm dcbt command to touch a range of memory I know will be required for performing certain computations onto. My profiler shows a pattern of cache misses because of the sporadic access to elements inside this range (4 touched, 5 skipped and so on - producing a cache miss each 5th operation).

There is a function A() that has access to the exact range and its size. This A() function is called before another section that will also touch and use data from the range A() utilizes. Can I just use dcbt inside A() and then expect an improvement in B(), or do I have to use dcbt on the range in the same function that utilizes that collection of data?

I'm not familiar with the PowerPC architecture, but I would imagine that the contents of the cache are not affected by calling/returning from functions. — Oliver Charlesworth, Jun 14 '13 at 14:41
Probably it depends on what the functions are doing, because the processor may take things in its own "hands" and fetch what it thinks it will touch next. — teodron, Jun 14 '13 at 14:45
Sure, if the functions are doing non-trivial memory accesses, then this may well cause stuff to be evicted from cache. But I'd be very surprised if a simple `call` or `ret` (or whatever the equivalents are for PPC) altered things. But, as stated, I'm just speculating... — Oliver Charlesworth, Jun 14 '13 at 14:47

score 3 · Accepted Answer · answered Jun 14 '13 at 14:48

Assuming ALL the data used in A() fits in the cache, you should see improvement in B() too. However, you can also end up reading data into the cache that isn't being used, which serves no purpose to anything, and just causes the memory bus to be busy when it could be used to load some ACTUAL data that is needed, if your pattern is as sporadic as you say. By all means give it a try, but don't expect it to magically work effectively - it often takes a bit of "tuning" - particularly with regard to "how far ahead of where you are right now do you read the data".

Depending on the exact behaviour of A() and B(), for example if you are switching between reads and writes, and reading from one section and writing to a completely different section, batching up the writes to a "holding area", which is then copied to RAM is often a good plan - make the holding area something like 1/8-1/4 of the L1 cache.

[Caveat: I've got absolutely no experience at all with PowerPC architecture, but I have used cache prefetching and other memory optimisation techniques in my work with x86 processors, with some success at times, not so much success at other times]

Thanks for the last hint of holding the write zone into L1 and then writing it back to RAM. Indeed, my case is such that I'm seeing improvements in `B()` from touching ranges in `A()` (20% gain). So doing like you said might make things even better (50% being my goal). — teodron, Jun 14 '13 at 15:04

Cache prefetching scenario - power architecture

1 Answers1