I'm writing optimised Floyd-Warshall algorithm on GPU using numba. I need it to work in a few seconds in case of 10k matricies. Right now the processing is done in around 60s. Here is my code:
def calcualte_distance_to_all_gpu(matrix):
threadsperblock = (32, 32)
blockspergrid_x = math.ceil(matrix.shape[0]/ threadsperblock[0])
blockspergrid_y = math.ceil(matrix.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
calculate_distance_to_all_cuda[blockspergrid, threadsperblock](matrix)
@cuda.jit
def calculate_distance_to_all_cuda(matrix):
i, j = cuda.grid(2)
N = len(matrix)
for k in prange(N):
if i < matrix.shape[0] and j < matrix.shape[1]:
if matrix[i, k] + matrix[k, j] < matrix[i, j]:
matrix[i, j] = matrix[i, k] + matrix[k, j]
To be honest I'm pretty new to writing scripts on GPU, so do you have any ideas how to make this code even faster? I also noticed on my GPU that while processing there is only a small peak to 100% and then it's stops being busy, so maybe the problem is in sending data from CPU to GPU? If yes is there anyway to optimize this process? Or maybe should I use diffrent algorithm for this task?