How do I parallelize this small numba script of a large (already CUDA-runnable) loop using Tensorflow?

Question

As a first walking attempt to execute routines on a graphics card, I have implemented a small function containing a large loop (1 billion steps). Althought it is not yet paralellized, the script runs quite fast on CUDA:

%%time
from numba import jit, cuda
import numpy as np
from math import sqrt

@cuda.jit
def find_integer_solutions_cuda(arr):
    i=0
    for x in range(0, 1000000000+1):
        y = float(x**6-4*x**2+4)
        sqr = int(sqrt(y))
        if sqr*sqr == int(y):
            arr[i][0]=x
            arr[i][1]=sqr
            arr[i][2]=y
            i+=1

arr=np.zeros((10,3))
find_integer_solutions_cuda[128, 255](arr)

print(arr)

This script works fine and finishes within 5 minutes using the thread config [128, 255] (other configurations slow it down) on a machine with 128GB RAM, an Intel Xeon CPU E5-2630 v4, 2.20GHz processor and two graphic cards of type Tesla V100 with 16GB RAM each. It yields:

[[0.00000000e+00 2.00000000e+00 4.00000000e+00]
 [1.00000000e+00 1.00000000e+00 1.00000000e+00]
 [7.08337220e+07 2.64700090e+09 7.00661374e+18]
 [6.56031067e+08 2.29447517e+09 5.26461630e+18]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00]]
CPU times: user 58.5 s, sys: 4min 5s, total: 5min 4s
Wall time: 5min 4s

Background: I am experimenting with the runtime behavior and performance outcome of large loops doing (fairly simple) arithmetic/mathematical tasks. The equation in the code snippet above is just an example. What I observed is that using this piece of code performs best using numba's @JIT-decorator (5 seconds). Using gmpy2 the same task finishes within 15 minutes. Using no optimization (just pure numpy) the same routine takes almost 2 hours. I'm curious how it looks when I parallelize the routine on a GPU-powered machine via Tensorflow. To put it briefly, we have for this routine running up to 1 billion:

5sec (numba/JIT) < 5min (numba/CUDA.JIT) < 15min (gmpy2) < 2hrs (plain numpy)

My question: I would like to extend this experiment and incorporate real parallelism by transfering this small script to tensorflow, if it would be able to "fill a tensor with jobs" to be parallized. I'm kind of thinking in the direction of filling tensors with numbers, where the graphics card then takes these numbers and inserts them into the equation y = float(x**6-4*x**2+4) in parallel, does the checking, and fills the result array.

Your numba code is totally bogus. You do understand that you are running that calculation completely serially? You literally have 32640 GPU threads doing the exact same thing. And the code is only doing that serial calculation on one of your two GPUs — talonmies, Jan 18 '22 at 13:42
Yes I have not a parallelization here. You are totally right. However it is still faster than running it without Cuda (doing the same routine classically it takes hours). — Eldar Sultanow, Jan 18 '22 at 15:00
This makes no sense to "parallelise" a CUDA code on a machine with one GPU. CUDA codes should already parallel. Running many sequential CUDA code in parallel is a really bad idea (because the GPU cannot benefit from using SIMD instructions which is critical for performance). If this CUDA code is faster than the tested CPU code, this means the CPU code used is really inefficient. Using 1 CUDA core is always slower than a CPU core (except embedded ones). What makes GPU fast is the large amount of cores available (compared to CPU, assuming they can be compared). — Jérôme Richard, Jan 18 '22 at 15:22
Honestly, I do not think you need a GPU for such a task. A parallel CPU task should be quite fast (~1 sec). The data transfer from/to the GPU will likely be slower than the computation and the computation should be mostly memory bound. Not to mention doing a filtering efficiently on the GPU is not so simple (especially for new CUDA users). Moreover, It also adds a dependency to Nvidia GPUs. — Jérôme Richard, Jan 18 '22 at 16:46
OK - thank you for this useful hint. I missed the fact of overhead occuring due to data transfer to/from GPU. Good point! — Eldar Sultanow, Jan 18 '22 at 17:19
I am surprised about the speed of Numpy. An optimized Numpy code should be pretty fast. However, the problem should likely comes from the huge temporary arrays created if no chunks are created (not having enough RAM for example can slow down a lot the computation). Note that int is of size 32-bit on some machines (eg. most Windows platforms for example), which makes your code bogus because of overflow. *Correctness matters more than performance* (producing very fast wrong program is easy). Note that `gmpy2` use multi-precision to avoid such problems (which also slow down the execution). — Jérôme Richard, Jan 18 '22 at 17:44
Absolutely correct: It is a balance between precision and speed. Here the best compromise is indeed `gmpy2`. That is one finding of this investigation. Sometimes speed may be the better choice over precision, when you can sort out these "False Positives" efficiently afterwards. I must admit to myself that I am still a bit curious how about an tensorflow based parallelization (possibly it will range between JIT and JIT.CUDA). — Eldar Sultanow, Jan 18 '22 at 18:10
The thing is if you work with native integers like Numba does, the result will be simply wrong due to the rounding and the overflow. The compiler is free to make crazy assumptions due to the integer overflow which cause an *undefined behavior*. It can for example partially remove the loop. In fact, the JIT used by Numba does such thing. The loop is only valid for values < ~1500. Above, the result is neither an under-approximation or over-approximation. Moreover, note that GPUs are pretty bad to deal with large integers/floats. Moreover, Numba do not natively support that either. — Jérôme Richard, Jan 18 '22 at 20:38
Note that [WolframAlpha](https://www.wolframalpha.com/input/?i=solve+x**6+-+4*x**2+%2B+4+%3D+n**2+for+n+over+the+integer) solves your problem in few seconds in an accurate way (apparently over all possible integers). Using advanced algebra is much better than brute-forcing all the possible values on a CPU/GPU. — Jérôme Richard, Jan 18 '22 at 21:11
You are right - mathematically it is clear that no other solutions exist. It is just an example for a large loop and integer overflows. Obviously there might exist better examples. Here the python principle of parallelization is interesting. — Eldar Sultanow, Jan 18 '22 at 21:29
Interesting how much time outputs code from [my answer](https://stackoverflow.com/a/70820501/941531) on your GPU? Just curious. Also maybe set `N` to 10 Billion or 100 Billion in my code if it takes too fast on your GPU. 1 Billion on my 2-core 1.2 GHz laptop takes just 89 seconds. — Arty, Jan 23 '22 at 09:24
I will test it on the Tesla Server right now and will let you know. Its running :-) — Eldar Sultanow, Jan 23 '22 at 10:04
10^9 takes too much, if 10^6 is 13 seconds than 10^9 will be more than hour, test 10^7, it is optimal, 10^8 at most. 13 seconds also include a lot of initialization work, so 10^6 is too small, use 10^7. — Arty, Jan 23 '22 at 10:28
Good hint. On the machine (128GB RAM, Xeon CPU, 2 Tesla cards), I restarted the jobs and obtained the following results: 9.324 sec for `10**6`, 94.630 sec for `10**7`, 909.235 sec for `10**8`. For the sake of completeness I started also a job with `10**9` (lets see how long it takes). — Eldar Sultanow, Jan 23 '22 at 10:58
Finally the `10**9` case finished within 9187.678 seconds :-) — Eldar Sultanow, Jan 23 '22 at 13:48
@EldarSultanow If you meant your code not to test GPU speed, but for example to do real task of computing `order` of elliptic curve (compute number of points that it has) then of course you should use something like Sieve as in [my other answer](https://stackoverflow.com/a/70810780/941531) or there also exist fast algorithms of computing elliptic curve order, although they are complex, I always wanted to implement them but never found any good description of theirs algorithm in Wikipedia. Also Sieving that I used in my other answer is very GPU friendly task, it uses small integers and fast. — Arty, Jan 23 '22 at 16:34
Yes you are right: Sieving is perfect for Elliptic Curves. I am interested in (CUDA, JIT, ...) tuning in general. For instance, my next task is to check perfect squares in the context of finding 6-tuples of squares `[s,t,u,t+u,t+u−s,t−s]`, see [this post](https://math.stackexchange.com/questions/4342283/for-integers-xyz-why-are-these-cases-impossible-for-mengolis-six-square-pr). I could generate those tuples up to `s=13572` (see [this Textfile](https://github.com/Sultanow/pythagorean/blob/main/pythagorean_stu.txt) on GitHub). Then it totally stucks. I will put this into a separate question. — Eldar Sultanow, Jan 23 '22 at 16:44
@EldarSultanow So I didn't figure out final understanding from your Math post, do `w < x < y < z` that you search May exist or they Not exist for sure, according to two answers of your Math post. In other words do I need to think about this interesting task of searching for `w < x < y < z` in my spare time? Also maybe you can make it more Programming-oriented by creating similar Question on programming site Stack-Overflow? Would be great, interesting task! Searching for triples `x < y < z` is also an interesting challenge, also because you found some solutions already, so they for sure exist! — Arty, Jan 23 '22 at 17:00
@EldarSultanow Accidentally looked at [your code](https://github.com/Sultanow/pythagorean/blob/68f2937ca890ba2080658d2ff945239bdb6459bf/pythagorean_search_gpu.py#L14) and may suggest you that all Numpy based Numba function will benefit a lot from `@njit` instead of `@jit` if it will compile. Because if you use `@njit` then Numba compiles to very efficient [LLVM-assembly](https://llvm.org/) based code that is optimized by LLVM and converted to native machine code.Both njit/jit have two params `cache=True,paralell=True`,1st allows to not recompile on each run,2nd runs all Numpy on all CPU cores — Arty, Jan 23 '22 at 17:05
@EldarSultanow If you use `@njit` as I described in previous comment then all Python related stuff like doing complex `print()` or using `gmpy2` needs to be wrapped into blocks like `with numba.objmode(): print(str(some_pure_python_object))`, otherwise `@njit` function will not compile, as njit doesn't allow any interaction with Python, it is pure native assembler, same as compiled C++ code. All interactions with Python should be done within `numba.objmode()` blocks as I showed above. — Arty, Jan 23 '22 at 17:08

score 1 · Answer 1 · answered Jan 23 '22 at 08:56

If you want just to measure speed of your GPU, but not to solve your task up to x of 1 Billion, then I can suggest following.

As you have order x ** 6 of your polynomial then it means that its value will be below 10^18 only for x below 1000. As we know 18 * 10^18 is approximate value of maximal uint64.

But we're doing sqrt operation on double floating values, which is 53-bit precision (mantissa is 53 bits) at most. Also to take into account rounding errors of sqrt, we can fit into double forx below 200-400.

If you overflow double precision then sqrt result is not precise and you can't do comparison sqr * sqr == int(y) anymore, due too large error. You'll simply will not find squares and will not solve your task (except if you just wanted to check speed of GPU).

Unfortunately Tensorflow doesn't have infinite-precision integer or floating arithmetics. It has only float64/int64 at most. Otherwise we could use X as big as we want. With limited precision we can check only small X values.

To solve task precisely I provide Tensorflow code below for x limited to just 200. But to have same amount of computations as you have in your code snippet I just repeat small values of X many times until I have 1 Billion values.

You can tweak param N = 10 ** 6 (total amount of values) to any value that you want, but it should be multiple of block = 200, and block shouldn't be larger than 200, you can tweak both params.

You may also set block = N if you want just to test speed of GPU, and not solve task precisely. In this case there will be large error of computing polynomial and square root, but with this error you still can check GPU speed. You'll just have wrong final results for large X.

At the end of my code I output all found squares by showing list of tuples (x, y, y^2). Also total running time is outputted.

Before running code don't forget to install Tensorflow through python -m pip install tensorflow.

On my 1.2 Ghz 2-core laptop it takes 89 seconds to process 1 Billion values with following Tensorflow code.

import time, tensorflow as tf

N = 10 ** 6
block = 200
nblocks = N // block

tb = time.time()

x = tf.repeat(tf.range(0, block, dtype = tf.int64)[None, :], nblocks, axis = 0)
f = x ** 6 - 4 * x ** 2 + 4
r = tf.cast(tf.math.sqrt(tf.cast(f, tf.float64)) + 0.5, tf.int64)
is_square = tf.where(r * r == f)

vals = [tuple(map(int, [i, j, x[i, j], r[i, j], f[i, j]])) for i, j in is_square]
print('(x, y, y^2):\n', [(x0, y0, y0_sqr) for i, j, x0, y0, y0_sqr in vals if i == 0])
print(f'Time {time.time() - tb:.03f} sec')

Output:

(x, y, y^2):
 [(0, 2, 4), (1, 1, 1)]
Time 88.912 sec

I am surprised by the timing results as this is quite slow. Numba succeed to do the same thing (on a CPU) in ~4 seconds on my machine (with an optimized code). I guess this is because Tensorflow generates (slow/huge) temporary arrays. It is also worth noting that good desktop GPUs should compute this task slower than most desktop middle-end CPUs because they do not support efficiently 64-bit operations (this is much slower than 32-bit ones). In fact an expensive RTX 3070 [can only archive](https://en.wikipedia.org/wiki/GeForce_30_series) what my quite cheap 2-year-old i5-9600KF processor. — Jérôme Richard, Jan 23 '22 at 15:49

How do I parallelize this small numba script of a large (already CUDA-runnable) loop using Tensorflow?

1 Answers1

Linked