PyTorch list slicing on GPU slower than on CPU

Question

I would like to optimize ML code (SSD in PyTorch) on NVIDIA Jetson Xavier NX (development kit). One of the bottlenecks seems to be list slicing on PyTorch (1.6.0) tensors on GPU device.

The same problem occured on NVIDIA GeForce GTX 1050 Ti (GP107), CPU was ~2 times faster.

Let me create the variables first

import torch
from time import time

cuda0 = torch.device('cuda:0')

probs = torch.ones([3000], dtype=torch.float64, device=cuda0)
mask = torch.ones([3000], dtype=torch.bool, device=cuda0)

probs_cpu = probs.cpu()
mask_cpu = mask.cpu()

Then run the logic (Approximately same results occurred every run)

before = time()
probs[mask]
print(f'GPU {time() - before:.5f}') # output: GPU 0.00263


before = time()
probs_cpu[mask_cpu]
print(f'CPU {time() - before:.5f}') # output: CPU 0.00066

Why is the list slicing ~4 times slower on GPU compared to CPU using PyTorch library vesrion 1.6.0 on NVIDIA Jetson Xavier NX Developer kit according to the code above? How to speed it up?

Code details: see line 51 in predictor.py which is part of SSD Implementation in PyTorch

Run it on CPU?: Whole algorithm will not be faster if I run it on the CPU since the downloading from GPU takes too long (~0.00805 s).

I can reproduce this on a desktop. My guess is that sending an instruction to the GPU takes time. — hkchengrex, Sep 17 '20 at 12:32
I cannot directly reproduce this on google colab with a Tesla P4: `GPU 0.00870 CPU 0.01367`. But results vary a lot. So @hkchengrex might have a point here. When you take larger tensors however, like `3000000`, things change very much in favour for the GPU: `GPU 0.00115 CPU 0.03155`. See also https://stackoverflow.com/questions/53325418/pytorch-speed-comparison-gpu-slower-than-cpu/53327162#53327162 — MBT, Sep 17 '20 at 20:11
Please check [here](https://discuss.pytorch.org/t/how-to-measure-time-in-pytorch/26964/2) for time measurement on CUDA and related discussion. Using events you will get only time spend by GPU on this operation. — Szymon Maszke, Sep 17 '20 at 22:28
The vector of unknown size is a result of the operation - thats probably the reason why it takes so long on GPU. — Petr Dvořáček, Oct 05 '20 at 16:58
I have same issue that boolean-masking in GPU is 200 times slower than CPU and I also think it's because arbitrary sized vector is hard to handle in CUDA. — SHIM, Sep 06 '22 at 08:47

PyTorch list slicing on GPU slower than on CPU

0 Answers0