I am using fftw library along with OpenMP and the best speedup I was able to achieve is ~ 2-3x only. Even with 6 cores and 6 threads I still barely achieve ~ 2.5! I am simply following the documentation of fftw: HERE
The example code I have:
#include <fftw3.h>
#include"omp.h"
static const int nx = 128;
static const int ny = 128;
static const int nz = 128;
fftw_complex *input_array;
input_array = (fftw_complex*) fftw_malloc((nx*ny*nz) * sizeof(fftw_complex));
memcpy(input_array, Re.data(), (nx*ny*nz) * sizeof(fftw_complex));
fftw_complex *output_array;
output_array = (fftw_complex*) fftw_malloc((nx*ny*nz) * sizeof(fftw_complex));
fftw_init_threads();
fftw_plan_with_nthreads(omp_get_max_threads());
fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(forward);
fftw_destroy_plan(forward);
fftw_cleanup();
memcpy(Im.data(),output_array, (nx*ny*nz) * sizeof(fftw_complex));
fftw_free(input_array);
fftw_free(output_array);
The flags I am using are:
-lfftw3 -lfftw3_threads -Ofast -fopenmp
Is this the best speedup I can achieve using built-in openmp with fftw? Should I rewrite this to be faster/better? Thanks!