1

I'm running the following simple code on a strong server with a bunch of Nvidia RTX A5000/6000 with Cuda 11.8. For some reason, FFT with the GPU is much slower than with the CPU (200-800 times). Does anyone have an idea of why that might be? I tried different GPUs but the results remain approximately the same.

    import sigpy as sp
    import torch
    import time

    arr = sp.shepp_logan((256, 256))
    device = "cpu"
    arr = torch.from_numpy(arr).to(device)
    tic = time.perf_counter()
    res = torch.fft.fft2(arr, dim=(-2, -1))
    toc = time.perf_counter()
    cpu_time = toc - tic
    device = "cuda:5"
    arr = arr.to(device)
    tic = time.perf_counter()
    res = torch.fft.fft2(arr, dim=(-2, -1))
    toc = time.perf_counter()
    gpu_time = toc - tic
    print(f"CPU time: {cpu_time}, GPU time: {gpu_time} ratio: {gpu_time / cpu_time}")

Thanks!

MRm
  • 517
  • 2
  • 14
  • Have you tried different sizes? Seen how it scales? – Homer512 Jun 08 '23 at 22:29
  • Yes, I tried increasing the input size significantly to 32 X 1024 X1024 (32 is the batch size) and it is still much slower (this time 20 times slower) – MRm Jun 08 '23 at 22:35

1 Answers1

1

Okay, so digging a little bit deeper, this is not the right way of comparing compute time, and for a better comparison, we need to average more. After doing that, I see that, indeed, the GPU version is faster (more noticeable for larger inputs). So there seems to be some "warm-up" time for the GPU (though I didn't expect such a big difference for a single test point). I'd love to hear if anyone has an explanation for why this is happening!

import numpy as np
import time
import torch

IM_SIZE = 512
BATCH_SIZE = 8
N_TEST = 10000
RAND = 100

arrs = np.random.randn(RAND, IM_SIZE, IM_SIZE)
arrs = torch.from_numpy(arrs)

device = "cpu"
tic = time.perf_counter()
for i in range(N_TEST):
    arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
    arr = arr.to(device)
    res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
cpu_time = (toc - tic) / N_TEST

device = "cuda:5"
tic = time.perf_counter()
for i in range(N_TEST):
    arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
    arr = arr.to(device)
    res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
gpu_time = (toc - tic) / N_TEST
print(f"CPU time: {cpu_time * 1000} ms, GPU time: {gpu_time * 1000} ms ratio: {gpu_time / cpu_time}")
MRm
  • 517
  • 2
  • 14