0

I am currently trying to improve the performance of my multithreaded FFTW implementation. In the documentation of fftw3 I read that for best-possible performance, the fftw_malloc function should be used to allocate in- and output data of the DFT.

Since I am dealing with large 3D arrays of size 256*256*256, I have to create them on the heap with

const unsigned int RES = 256;

std::complex<double>(*V)[RES][RES];
V = new std::complex<double>[RES][RES][RES];

And after initialization, I create multithreaded (in-place) fftw_plans for the 3D DFT transforms according to

int N_Threads = omp_get_max_threads();
fftw_init_threads();
fftw_plan_with_nthreads(N_Threads);

fftw_complex *input_V = reinterpret_cast<fftw_complex*>(opr.V);
fftw_plan FORWARD_V = fftw_plan_dft_3d(RES, RES, RES, input_V, input_V, FFTW_FORWARD, FFTW_MEASURE);
fftw_plan BACKWARD_V = fftw_plan_dft_3d(RES, RES, RES, input_V, input_V, FFTW_BACKWARD, FFTW_MEASURE);

My question now is: How do I create these plans using fftw_malloc instead ?

In the fftw3 documentation I can only find

fftw_complex *in;
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);

which I understand as a 1D example. Do I have to project my 3D array or is the use of fftw_malloc not possible/advisable in this case?

Azure27
  • 37
  • 5
  • I know it has been a while since you posted this, but I am wondering if you get a good speed up with your above code? Does this have a good scaling? I am having the exact same set up but openmp and multithreading is not giving me a good speedup really! – Jamie Aug 13 '23 at 20:41
  • 1
    Do you mean the speedup I get by using the allocation or the parallelization/FFTW itself? The former is zero since I seemingly allocated them correctly before anyway. The parallelization itself greatly improved my computation times, but I had to tweak around with compiler optimization flags and adjusting the number of FFTW/OMP threads for this. It seems for example that FFTW scales best if I use N_Threads = 2^n <= 64, depending on the machine – Azure27 Aug 15 '23 at 15:54
  • I meant the speedup from the parallelization/FFTW itself. What kind of optimization flags you used for it to be improved? I have this similar code that returns only twice as much speedup regardless of numbers of threads and was wondering if that's the best case scenario: https://stackoverflow.com/questions/76833427/openmp-how-can-i-increase-the-speedup-with-number-of-threads – Jamie Aug 15 '23 at 18:07
  • It should definetely not be the same regardless of the number of threads. What exactly are you measuring in your performance tests? Is it the time spent calling fftw_execute() or in the entire script/part of it? As others have mentioned, planning the FFT takes a long time, which is why you only do it once and then call the FFT many times after that. So what you want to compare is how long fftw_execute() takes, e.g. to be called 1000 times with different N_Threads. Also you should use `FFTW_MEASURE` as a planner flag for best performance. My compiler flags are `-lfftw3_omp -lfftw3 -Ofast` – Azure27 Aug 15 '23 at 20:32

1 Answers1

1

malloc and its cousins (like your fftw_malloc) allocate single dimensional buffers, so in your case what you want is to create a buffer large enough to hold your three dimensional data:

fftw_malloc(sizeof(fftw_complex) * RES * RES * RES);

I read that for best-possible performance, the fftw_malloc function should be used

It's important to ask "why" whenever you see a statement like that. Specifically, non-aligned allocations incur a paging penalty, so this malloc variant is trying to allocate aligned memory. It's not magic, and you can definitely do that yourself as well, for example using aligned_alloc.

Blindy
  • 65,249
  • 10
  • 91
  • 131
  • Thanks a lot for your answer. Just to make sure I am understanding this correctly: With the `fftw_malloc` command I would allocate the memory but would still have to initialize it. So to create the plan, I should also change my `V[RES][RES][RES]` array into a single-dimensional [row-major format](https://www.fftw.org/fftw3_doc/Row_002dmajor-Format.html) like that of `fftw_malloc`? – Azure27 Nov 24 '21 at 21:23
  • 1
    Yeah, you'll essentially be doing the multi-dimensional indexing by hand, or with a helper function. – Blindy Nov 25 '21 at 21:45