1

I am running a UNet with PyTorch on medical imaging data with a bunch of transformations and augmentations in my preprocessing. However, after digging into the different preprocessing packages like Torchio and MONAI, I noticed that most of the functions, even when they take Tensors as IO, are running things on CPU. The functions either straight up take numpy arrays as input or call .numpy() on the tensors.

The problem is that my data is composed of 3D images of dimension 91x109x91 that I resize in 96x128x96 so they are pretty big. Hence, running transformations and augmentations on CPU is pretty inefficient I think.

First, it makes my program CPU bound because it takes more time to transform my images than running them through the model (I timed it many times ). Secondly, I checked the GPU usage and it's oscillating between pretty much 0% and 100% at each batch so, it's clearly limited by the CPU. I would like to speed it up if it's possible.

My question is: Why are these packages not using the GPUs? They could at least have hybrid functions taking either a numpy array or a Tensor as input as a lot of numpy functions are available in Torch as well. Is there a good reason to stick to the CPU rather than speeding up the preprocessing by loading the images on GPU at the beginning of the preprocessing?

I translated a simple normalization function to work on GPU and compare the running time between the GPU and CPU version and even on a laptop (NVidia M2000M) the function was 3 to 4 times faster on GPU.

On an ML discord, someone mentioned that GPU-based functions might not give deterministic results and that's why it might not be a good idea but I don't know if that's actually the case.

My preprocessing includes resizing, intensity clamping, z-scoring, intensity rescaling, and then I have some augmentations like random histogram shift/elastic transform/affine transform/bias field.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
Chris Foulon
  • 49
  • 1
  • 5

1 Answers1

1

A transformation will typically only be faster on the GPU than on the CPU if the implementation can make use of the parallelism offered by the GPU. Typically anything that operates element-wise, or row/column-wise can be made faster on GPU. This therefore concerns most image transformations.

The reason why some libraries don't implement things on GPU is that it requires additional work for each Tensor manipulation library you want to support (Pytorch, Tensorflow, MXNet, ...), and you still have to maintain another CPU implementation anyway. Since you're using PyTorch, checkout the torchvision package that implements many transformations for both GPU and CPU tensors.

For more complex transformations, like elastic deformation, I'm not sure if you can find a GPU version. If not, you might have to write one yourself, or drop this transformation, or pay the cost of copying back-and-forth between CPU and GPU during your data augmentation.

Another solution that some people prefer is to precompute a large set of transformation on CPU as a separate step and to save the result in a file. The HDF file format is commonly used to save large datasets that can then be read very fast from disk. Since you will be saving a finite set of augmentation, be careful to generate several augmentations for each sample of your dataset to conserve a somewhat random behavior. This is not perfect, but it's a very pragmatic that will likely speed things up quite a bit if your CPU is holding your GPU back.

Regarding the determinism of the GPU, it's true that floating point operations are not by default guaranteed to be deterministic when run on GPU. This is because reordering some floating point operations can make them faster, but the reordering cannot guarantee that the result will be exactly the same (it will be close of course!). This can matter for reproducibility, if you use a seed in your code and get slightly different results. See the Pytorch Documentation to understand other sources of non-determinism.

francoisr
  • 4,407
  • 1
  • 28
  • 48