3

I have a server which is applying filters (implemented as OpenGL shaders) to images. They are mostly direct colour mappings but also occasionally blurs and other convolutions.

The source images are PNGs and JPGs in a variety of sizes from e.g. 100x100 pixels upto 16,384x16,384 (texture size for my GPU).

The pipeline is:

Decode image to RGBA (CPU)
        |
        V
Load texture to GPU
        |
        V
   Apply shader (GPU)
        |
        V
Unload to CPU memory
        |
        V
  Encode to PNG (CPU)

The mean GPU timings are approx 0.75ms to load, 1.5ms to unload and 1.5 ms to process a texture.

I have multiple CPU threads decoding PNGs and JPGs to provide a continuous stream of work to the GPU.

The challenge is that watch -n 0.1 nvidia-smi reports that the GPU utilisation is largely about 0% - 1%, spiking to 18% periodically.

I really want to be getting more value out of the GPU, ie I'd like to see it's load at least around 50%. My questions:

  • Is nvidia-smi giving a reasonable representation of how busy the GPU is? Does it for example include time to load and unload textures? If not, is there a better metric I could be using.

  • Assuming that it is, and the GPU is sitting back doing nothing, are there any well understood architectures for increasing throughput? I've considered tiling multiple images into a large texture but this feels like it'll blow out CPU usage rather than GPU.

  • Is there someway I could be loading the next image to GPU texture memory while the GPU is processing the previous image?

Richard
  • 56,349
  • 34
  • 180
  • 251
Dave Durbin
  • 3,562
  • 23
  • 33
  • For the direct color mapping you should not use the GPU. It's surely faster doing it in the CPU. – Cris Luengo Nov 07 '19 at 00:06
  • I may have misled while simplifying. (Most of) the shaders are computing colour mappings on the fly but the computations are very simple i.e. brightness enhancement and do not require convolution or kernels. – Dave Durbin Nov 07 '19 at 00:24
  • @DaveDurbin: "*Is there someway I could be loading the next image to GPU texture memory while the GPU is processing the previous image?*" How are you not doing that already? Are you trying to upload to the image currently being used? I mean, this all seems like a pretty simple case of triple-buffering. – Nicol Bolas Nov 07 '19 at 00:42

1 Answers1

2

Sampling nvidia-smi is a really poor way of figuring out utilization. Use Nvidia Visual Profiler (I find this easiest to work with) or Nvidia Nsight to get a true picture of what your performance and bottlenecks are.

It's hard to say how to improve performance without seeing your code and without you having a better understanding of what the bottleneck is.

  • You say you have multiple CPU threads going, but do you have multiple CUDA streams so you can hide the latency of data transfer? This allows you to load data into the GPU while it is processing.
  • Are you sure you have threads and not processes? Threads might reduce overhead.
  • Applying a single shader on the GPU will take almost no time, so your pipeline might ultimately be limited by your hard drive's speed or your bus speed. Have you looked up this specs, measured the size of your images, and found a theoretical value for your maximum processing capability? Your GPU is likely to spend a lot of time being idle unless you're doing a lot of complicated math on it.
Richard
  • 56,349
  • 34
  • 180
  • 251
  • Thanks for the pointer to CUDA streams - and NVIDIA Nsight, I'm looking for CLI based tool that can integrate with e.g. AWS Cloudwatch. Much of the complexity of the data transfer is hidden behind OpenGL calls. I had considered using CUDA rather than OpenGL since the individual shaders are pretty simple and overlapping data transfer with processing seemed like it may improve throughput but I wasn't sure that was possible. I'll take a look. – Dave Durbin Nov 07 '19 at 00:21
  • @DaveDurbin: Visual Profiler and Nsight can be used to control remote processes on headless machines. The CLI tool `nvprof` can also be used to generate output which can later be analyzed in NVVP. – Richard Nov 07 '19 at 01:07