-1

I am using fftw library along with OpenMP and the best speedup I was able to achieve is ~ 2-3x only. Even with 6 cores and 6 threads I still barely achieve ~ 2.5! I am simply following the documentation of fftw: HERE

The example code I have:

#include <fftw3.h>
#include"omp.h" 

static const int nx = 128; 
static const int ny = 128;  
static const int nz = 128;

fftw_complex *input_array;
input_array = (fftw_complex*) fftw_malloc((nx*ny*nz) * sizeof(fftw_complex));
        
        
memcpy(input_array, Re.data(), (nx*ny*nz) * sizeof(fftw_complex));

fftw_complex *output_array;
output_array = (fftw_complex*) fftw_malloc((nx*ny*nz) * sizeof(fftw_complex));

fftw_init_threads();
fftw_plan_with_nthreads(omp_get_max_threads());
fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(forward);
fftw_destroy_plan(forward);
fftw_cleanup();

        
memcpy(Im.data(),output_array, (nx*ny*nz) * sizeof(fftw_complex));

fftw_free(input_array);
fftw_free(output_array);

The flags I am using are:

-lfftw3 -lfftw3_threads -Ofast -fopenmp

Is this the best speedup I can achieve using built-in openmp with fftw? Should I rewrite this to be faster/better? Thanks!

Jamie
  • 365
  • 2
  • 5
  • 13
  • Your example program doesn't even compile. I think it needs a `main()` function. – Toby Speight Aug 04 '23 at 07:48
  • 1
    If you want FFTW to use OpenMP, you should use `-lfftw3_omp` instead of `-lfftw3_threads` (which probably uses POSIX threads) according to the docs you linked. You might want to find out how much of the runtime is actually spend in the parallelized FFT, e.g. with [`omp_get_wtime()`](https://www.openmp.org/spec-html/5.0/openmpsu160.html). I would expect a lot of runtime to be used for planning the FFT (+ allocation and `memcpy`s). If you use that FFT only once, you wont find much benefit. The assumption is that you reuse it a lot. – paleonix Aug 04 '23 at 08:32
  • 2
    Also big FFTs are generally memory bandwidth-limited, so if e.g. 3 threads are already fully use the memory subsystem, any additional threads will only add overhead. – paleonix Aug 04 '23 at 08:46
  • 1
    You could also use e.g. `FFTW_PATIENT` once to get a better plan and then [use wisdom](https://fftw.org/doc/Words-of-Wisdom_002dSaving-Plans.html#Words-of-Wisdom_002dSaving-Plans) to cache it for future runs. – paleonix Aug 04 '23 at 08:53

0 Answers0