What is the fastest way to perform FFT on a large file?

Question

I am working on a C++ project that needs to perform FFT on a large 2D raster data (10 to 100 GB). In particular, the performance is quite bad when applying FFT for each column, whose elements are not contiguous in memory (placed with a stride of the width of the data).

Currently, I'm doing this. Since the data does not fit in the memory, I read several columns, namely n columns, into the memory with its orientation transposed (so that a column in the file becomes a row in the memory) and apply FFT with an external library (MKL). I read (fread) n pixels, move on to the next row (fseek as much as width - n), read n pixels, jump to the next row, and so on. When the operation (FFT) is done with the column chunk, I write it back to the file in the same manner. I write n pixels, jump to the next row, and so on. This way of reading and writing file takes too much time, so I want to find some way of boosting it.

I have considered transposing the whole file beforehand, but the entire process includes both row-major and column-major FFT operations and transposing will not benefit.

I'd like to hear any experiences or idea about this kind of column-major operations on a large data. Any suggestions related particularly to FFT or MKL will help as well.

Have you tried [memory-mapped files](https://en.wikipedia.org/wiki/Memory-mapped_file)? — Scheff's Cat, Aug 08 '18 at 07:57
100GB data sets are not "big" by contemporary standards. For example, `x1.16xlarge` instances on AWS have 1TB RAM and can be hired for 2 dollars an hour at spot prices. If you only process couple of data sets a day and can store your data in the cloud (say AWS S3) it may be more worthwhile to just hire the necessary computing power on demand. — oakad, Aug 08 '18 at 08:05
@oakad: Removing needless slowdowns is especially important when you're paying by the hour. This question is equally relevant for AWS applications. — MSalters, Aug 08 '18 at 09:11
Are you using reasonable values for `n` ? I would expect a value of `4096/sizeof(pixel)` to work reasonably well. Of course, this sort of code should run off a fast SSD; I do agree with oakad's general idea of using fast HW. On an SSD, 4K random reads should be plenty fast. — MSalters, Aug 08 '18 at 09:18
@MSalters Nope, the question is about FFTing the "slow" stored data which would not directly fit in RAM. My comment is about hiring enough RAM and not bothering with the "slow" storage. — oakad, Aug 09 '18 at 01:35
Is the array square or rectangular? What dimensions? Is the array already stored as complex-valued or is it real-valued? Are you doing a real-to-complex transform (computing half the coefficients) or a full complex-to-complex (computing both sides of the spectrum)? Are you doing a 2D transform or just a 1D transform in the slow dimension? Do you need all the output coefficients or do you just need some of them? Do you just need the FFT coefficients or are you doing some operation in the Fourier domain and converting the data back to time-domain (IFFT) afterwards? — Ahmed Fasih, Aug 10 '18 at 10:32
Because each output of the FFT depends on all the inputs, there's no way around having to scan the entire vector (1D FFT) or array (2D FFT) to produce each output element, and that's going to be slow on slow storage (SSD or RAM). There's ways you can cut waste (like real->complex if you can, praying your sizes have small prime factors, etc.) but my other questions are aimed at finding alternative ways of formulating the entire task. E.g., texture shading traditionally uses a 2D FFT, then Fourier scaling, then IFFT, but I found a 2D FIR (spatial-domain) approximation that skips huge FFTs. — Ahmed Fasih, Aug 10 '18 at 10:52
@oakad Unfortunately, the cloud services are not an option, since the system is not connected to the Internet due to some security constraints. I'm also trying to have more RAMs on the processing server in order to avoid using slow storages. But for now, I need to manage it with limited resources. — rooot, Sep 03 '18 at 01:29
@LeeDaekeun You may actually want to look into really old stuff - your present problem was much more common in the past. Take, for example, this article from NASA: https://www.nas.nasa.gov/assets/pdf/techreports/1989/rnr-89-004.pdf - it appears to have some pretty interesting advice for slow storage FFT. — oakad, Sep 03 '18 at 05:41

score 0 · Answer 1 · answered Aug 08 '18 at 09:17

0

Why not to work with both transposed and non-transposed data at the same time? That will increase memory requirement x2, but that may worth it.

answered Aug 08 '18 at 09:17

Anton Malyshev

8,686
2
27
45

score 0 · Answer 2 · answered May 29 '20 at 05:26

Consider switching to a Hadamard Transformation. As a complete IPS, the transform offers no multiplications, since all of the coefficients in the transform are plus or minus one. If you need the resultant transform in a fourier basis, a matrix multiplication will change bases.

What is the fastest way to perform FFT on a large file?

2 Answers2