0

I didn't think we could print anything from the GPU since calling print inside a @cuda.jit function doesn't work, but then I tried calling A.shape to see what would happen.

import numpy as np
from numba import  cuda

A = np.random.randn(1000, 1000)
A_gpu = cuda.to_device(A)
A_gpu.shape
(1000, 1000)
A_gpu[0][0]
0.4253498653987585
A_gpu.T
<numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f5de810ffa0>

For something to be printed to the console, do the numbers need to be copied to the CPU first?

%timeit A.T
%timeit A_gpu.T
%timeit A.shape
%timeit A_gpu.shape
%timeit A[0][0]
%timeit A_gpu[0][0]
132 ns ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
159 ms ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*76 ns* ± 2.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
*47.8 ns* ± 8.81 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
376 ns ± 146 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
161 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Calling A.shape is faster in the GPU for some reason, but the other functions are slower. However, it might be the case that accessing elements A[i, j] inside a @cuda.jit is optimized and isn't slower.

I am implementing a CUDA kernel for matrix multiplication, with the intention of using it for back propagation in neural networks, meaning dL_dX = np.dot(dL_dY, self.weights.T) will be performed very often.

If I need to transpose a matrix, I was wondering if it's bad practice to transpose from the GPU matrix_multiplication_gpu[blocks_per_grid, threads_per_block](A_gpu, B_gpu.T) and whether it would be better to transpose the matrix in the CPU first, and then move/"cache" the result to GPU cuda.to_device(A.T). Interestingly, moving the array to the GPU %timeit cuda.to_device(A.T) is much faster 2.41 ms ± 145 µs than transposing the array within the GPU.

BPDev
  • 397
  • 1
  • 9

1 Answers1

2

Numba gpu array transpose runs a GPU kernel. That is why it is slow compared to numpy, which will generally just change the strides and not touch the underlying data.

The canonical way to perform a dot product involving a transposed matrix or matrices (dating back to the origins of Linpack and BLAS) is to change the algorithm to handle reading the input in a transposed order rather than actually transposing the input data prior to performing the product operation.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Would there be a reason that taking the numpy approach and just change the strides doesn't work on the GPU? I am guessing that changing the [strides](https://i.stack.imgur.com/oQQVI.png) would make it harder to optimize cache, since the values in a column are physically far from each other. However, `np.random.rand(100, 100)` `%timeit A @ B` `%timeit A @ B.T` yields 260 µs ± 36.3 µs and 231 µs ± 11.4 µs, so numpy probably has a different algorithm for transposes. – BPDev Feb 22 '23 at 18:20
  • Update: I wanted to make sure that `A_T_gpu = cuda.to_device(A.T)` even works and A_T_gpu does have a `strides` property that works as expected. Therefore, we could just change the strides in the GPU if we wanted. – BPDev Feb 22 '23 at 18:45