3D FFT with data larger than cache

Question

I have searched for an answer to this question but have not found anything that can directly help me.

I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.

My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses, which appears to add a massive amount of overhead accessing memory.

Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?

Any help would be much appreciated.

Thanks,

Dylan

Sorry, 50% of the cache references are misses. I'll edit the original post to reflect this — Dylan Brown, Jun 08 '15 at 23:01

bazza · Answer 1 · 2015-06-09T07:16:15.137

2^30 data points in a single FFT counts as being quite big!

The data plus the exponentials and the output array is several thousand times bigger than the L3 cache, and millions times bigger than L1.

Given that disparity one might argue that a 50% cache miss rate is actually quite good, especially for an algorithm like an FFT which accesses memory in non-sequential ways.

I don't think that there will be much you can do about it. The MKL is quite good, and I'm sure that they've taken advantage of whatever cache hinting instructions there are.

You might try contacting Mercury Systems Inc. (www.mrcy.com) and ask them about their Scientific Algorithms Library (SAL). They have a habit of writing their own math libraries, and in my experience they are pretty good at it. Their FFT on PowerPC was 30% quicker than the next best one; quite an achievement. You can try an un-optimised version of SAL for free (http://sourceforge.net/projects/opensal/). The real optimised for Intel SAL is definitely not free though.

Also bear in mind that no matter how clever the algorithm is, with a data set that size you're always going to be fundamentally stuck with main memory bandwidths, not cache bandwidths.

GPUs might be worth a look, but you'd need one with a lot of memory to hold 2^30 data points (32 bit complex values = 2gbytes, same again for the output array, plus exponentials, etc).

Thanks very much for the response, I'll give SAL a look and see how it goes — Dylan Brown, Jun 10 '15 at 21:29
@DylanBrown I have done some more digging; I don't think they've got round to doing all their own FFT variants on Intel yet. I strongly suggest you ask them for some benchmark timings before getting too deeply into it; it's not free, so you'd have to be sure before committing. — bazza, Jun 11 '15 at 02:18

score 0 · Answer 2 · answered Jun 09 '15 at 00:01

I think the problem of excessive misses is due to a failure of the cache prefetch mechanism, but not knowing the details of the memory accesses I can't tell you exactly why.

It does not matter that your arrays are very large, 50% misses are excessive. The processor should avoid misses by detecting you are iterating over an array and loading ahead of time the data elements you are likely to use.

Either the pattern of array accesses is not regular and thus the prefetcher in the processor does not figure out a pattern to prefetch, or you have a cache associativy problem, that is, elements in your iteration might be matched to the same cache slots.

For example, assume a cache size of 1Mb and a set associativy of 4. In this example, the cache will map memory using the lower 20 bits to an internal slot. If you stride by 1Mb, that is, your iterations are exactly 1Mb, then the lower 20 bits are always the same and go to the same cache slot, the new element shares the same cache slot as the old one. When you get to the fifth element, all four positions are used up and from then on it is only misses, in such case your cache size is effectively one single slot; if you stride by half the cache size, then the effective number of slots is 2, which might be enough to not have any misses at all or have 100% or anything in between depending on whether your access pattern requires both slots simultaneously or not.

To convince yourself of this, make a toy program with varying stride sizes and you'll see that those that divide or are multiples of the cache sizes increase misses, you can use valgrind --tool=cachegrind

Thanks very much for your response. I will try set up a sample and have a look — Dylan Brown, Jun 09 '15 at 02:53
This answer appears to misunderstand what a Fourier Transform is. Each output value depends on each input value, That is to say, you have 2^30 outputs depending on 2^30 inputs. Sure, you can access the input "regularly" (contiguously) for each output value. That's the O(N*N) non-fast fourier transform. The FFT gives up "regular" access in exchange for a much, much higher numerical performance. O(N) is 2^30, O(log N) is 30 here. — MSalters, Jun 09 '15 at 07:59

score 0 · Answer 3 · answered Jun 09 '15 at 07:14

You should first make sure you know what is causing the cache misses; they may be the fault of other code you've written rather than the FFT library. In fact, I expect that is very likely the case.

The rest of this post assumes that the FFT is really at fault and we need to optimize.

The standard trick to get data locality out of an FFT is to

Arrange the data in a two-dimensional array
Do an FFT along each row
Apply twiddle factors
Do a matrix transpose
Do an FFT along each row

This is the Cooley-Tukey algorithm, in the case where we factor 2^(m+n) = 2^m * 2^n.

The point of this is that the recursive calls to the FFT are much much smaller, and may very well fit in cache. And if not, you can apply this method recursively until things do fit in cache. And if you're ambitious, you do a lot of benchmarking to figure out the optimal way to do the splitting.

Thus, assuming you also use a good matrix transpose algorithm, the end result is a relatively cache-friendly FFT.

The library you're using really should be doing this already. If it's not, then some options are:

Maybe it exposes enough lower level functionality that you can tell it to use Cooley-Tukey in an efficient way even though the high level routines aren't
You could implement Cooley-Tukey yourself, using the given library to do the smaller FFTs.

Of course, this does concentrate the cache misses in the matrix transpose step. — MSalters, Jun 09 '15 at 07:53
@MSalters: ... which gets mitigated by using a good transpose algorithm. — , Jun 09 '15 at 08:36
@Hurkyl, Intel have missed a trick. The later Xeons have something called QuickData, basically an internal DMA engine. Unfortunately it's quite simple, it just moves data around. However DMA engines on some platforms can be programmed to do matrix transpositions whilst they're doing the DMA transfer itself (a transpose is just data shuffling afterall). It's a nice trick getting a DMA engine doing your matrix transpose whilst the CPU gets on with doing some maths. — bazza, Jun 09 '15 at 08:54

3D FFT with data larger than cache

3 Answers3