As a first walking attempt to execute routines on a graphics card, I have implemented a small function containing a large loop (1 billion steps). Althought it is not yet paralellized, the script runs quite fast on CUDA:
%%time
from numba import jit, cuda
import numpy as np
from math import sqrt
@cuda.jit
def find_integer_solutions_cuda(arr):
i=0
for x in range(0, 1000000000+1):
y = float(x**6-4*x**2+4)
sqr = int(sqrt(y))
if sqr*sqr == int(y):
arr[i][0]=x
arr[i][1]=sqr
arr[i][2]=y
i+=1
arr=np.zeros((10,3))
find_integer_solutions_cuda[128, 255](arr)
print(arr)
This script works fine and finishes within 5 minutes using the thread config [128, 255]
(other configurations slow it down) on a machine with 128GB RAM, an Intel Xeon CPU E5-2630 v4, 2.20GHz processor and two graphic cards of type Tesla V100 with 16GB RAM each. It yields:
[[0.00000000e+00 2.00000000e+00 4.00000000e+00]
[1.00000000e+00 1.00000000e+00 1.00000000e+00]
[7.08337220e+07 2.64700090e+09 7.00661374e+18]
[6.56031067e+08 2.29447517e+09 5.26461630e+18]
[0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00]]
CPU times: user 58.5 s, sys: 4min 5s, total: 5min 4s
Wall time: 5min 4s
Background:
I am experimenting with the runtime behavior and performance outcome of large loops doing (fairly simple) arithmetic/mathematical tasks. The equation in the code snippet above is just an example. What I observed is that using this piece of code performs best using numba's @JIT
-decorator (5 seconds). Using gmpy2 the same task finishes within 15 minutes. Using no optimization (just pure numpy) the same routine takes almost 2 hours. I'm curious how it looks when I parallelize the routine on a GPU-powered machine via Tensorflow. To put it briefly, we have for this routine running up to 1 billion:
5sec (numba/JIT) < 5min (numba/CUDA.JIT) < 15min (gmpy2) < 2hrs (plain numpy)
My question:
I would like to extend this experiment and incorporate real parallelism by transfering this small script to tensorflow, if it would be able to "fill a tensor with jobs" to be parallized. I'm kind of thinking in the direction of filling tensors with numbers, where the graphics card then takes these numbers and inserts them into the equation y = float(x**6-4*x**2+4)
in parallel, does the checking, and fills the result array.