13

Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.

I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.

Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...

I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?

I am thinking of 3 possible solutions:

  • Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
  • Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
  • Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.

Other than that, is there any elegant solution?

VAndrei
  • 5,420
  • 18
  • 43
  • 2
    An occasional prefetch instruction has produced measurable benefits when I've been in similar situations. The trick is to be smart about *when*. – Drew Dormann Apr 26 '15 at 18:40
  • It generally does, especially when the line has been evicted but if I do a prefetch on a line that's already in the cache, does it update the reuse distance in the cache replacement algorithm? – VAndrei Apr 26 '15 at 18:44
  • 1
    Are you sure this will *actually* improve performance? I'd suggest going the opposite way: Uncached/non-temporal/write-combining stores. – EOF Apr 26 '15 at 19:20
  • Eviction control might be all you need. If `input1` isn't in cache, then `input2` should fit without having to displace `output`. – Ben Voigt Jun 08 '15 at 06:10

4 Answers4

1

Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.

While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done? I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.

pavan
  • 31
  • 4
0

I am trying to get a better understanding of the question:

If it is true that the 'output' array is strictly for output, and you never do something like

output[i] = Foo(newVal, output[i]);

then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?

In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.

Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.

KalyanS
  • 527
  • 3
  • 8
  • You'd only need one cacheline if you had a fully-associative cache, but nobody has fully-associative caches. – EOF Apr 26 '15 at 20:57
  • The problem is that i need to use output[] to store intermediary results of the computation. Also, in order to write to output[] the CPU Core first reads / fetches the cache line. I just want to avoid that fetching. – VAndrei Apr 27 '15 at 11:53
  • I think the only way to avoid the read-for-ownership that happens before a write is to use non-temporal stores. `movntdqa` and so on. This won't give you the desired result, because the write will bypass the cache (just using a store-buffer to combine multiple 16 or 32B writes into a single 64B transfer). Since you do need to read `output[]` again after writing temporary results, this isn't viable. – Peter Cordes Jul 05 '15 at 23:01
0

I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:

  1. If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.

  2. You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.

Simon
  • 778
  • 6
  • 18
0

I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size. Though I am not sure how much performance improvement we will see with this.

We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.

As pointed out earlier, the write to output array should happen in a stream.

To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.

ultimate cause
  • 2,264
  • 4
  • 27
  • 44