How to run Python script on a Discrete Graphics AMD GPU?

Question

WHAT I WANT TO DO:

I have a script that I use for factorizing prime numbers given a certain range:

# Python program to display all the prime numbers within an interval

lower = 900
upper = 1000

print("Prime numbers between", lower, "and", upper, "are:")

for num in range(lower, upper + 1):
   # all prime numbers are greater than 1
   if num > 1:
       for i in range(2, num):
           if (num % i) == 0:
               break
       else:
           print(num)

I would like to use the GPU instead of the CPU to run such script so it would be faster

THE PROBLEM:

I don't have a NVIDIA GPU on my Intel NUC NUC8i7HVK but a "Discrete GPU"

If I run this code to check what are my GPUs:

import pyopencl as cl
import numpy as np

a = np.arange(32).astype(np.float32)
res = np.empty_like(a)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)

prg = cl.Program(ctx, """
    __kernel void sq(__global const float *a,
    __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = a[gid] * a[gid];
    }
    """).build()

prg.sq(queue, a.shape, None, a_buf, dest_buf)

cl.enqueue_copy(queue, res, dest_buf)

print (a, res)

I receive:

[0] <pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7ffb3d492fd0>
[1] <pyopencl.Platform 'Intel(R) OpenCL HD Graphics' at 0x187b648ed80>

THE POSSIBLE APPROACH TO THE PROBLEM:

I found a guide that takes you by the hand and explains step by step how to run it on your GPU. But all Pyhton libraries that pipes Python through the GPU like PyOpenGL, PyOpenCL, Tensorflow (Force python script on GPU), PyTorch, etc... are tailored for NVIDIA.

In case you have an AMD all libraries ask for ROCm but such software still doesn't support integrated GPU or Discrete GPU as far as I know (see my own reply below).

I only found a guide that talks about such approach but I cannot make it work.

Is there hope or I'm just tying to do something impossible?

EDIT: Reply to @chapelo

If I choose 0 the reply is:

Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [  0.   1.   4.   9.  16.  25.  36.  49.  64.  81. 100. 121. 144. 169.
 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
 784. 841. 900. 961.]

If I choose 1 the reply is:

Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [  0.   1.   4.   9.  16.  25.  36.  49.  64.  81. 100. 121. 144. 169.
 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
 784. 841. 900. 961.]

Did you look at the [documentation of numba for AMD Gpu using `RocM`](https://numba.readthedocs.io/en/stable/roc/index.html)? — Lescurel, Jan 13 '21 at 15:01
@Lescurel Thank you, it looks like a good starting point. Apparently is only for Linux but I can sort it out. I will have a look at it. — Francesco Mantovani, Jan 13 '21 at 17:02
@Lescurel, I'm following this guide right now https://shawonashraf.github.io/rocm-tf-ubuntu/ . BTW, I'm doing all this on an Intel NUC8i7HVK https://ark.intel.com/content/www/us/en/ark/products/126143/intel-nuc-kit-nuc8i7hvk.html which doesn't really has a GPU but instead it has two "integrated" GPU. Am I wasting my time or Tensorflow, PyOpenGL, etc... can work also with integrated GPU? — Francesco Mantovani, Jan 13 '21 at 23:22
@chapelo, thank you for your interest. Nice question, I posted the reply. I cannot tell what that means — Francesco Mantovani, Jan 20 '21 at 20:57
It's telling you that it found 2 possible contexts 0 is your AMD and 1 is your Intel. Which do you prefer? Enter the value 0 or 1 and see what happens. To avoid being asked you have to set the environment variable PYOPENCL_CTX="0" if you want to use your AMD. — chapelo, Jan 20 '21 at 21:00
Thank you @chapelo, I selected `0`. Now how can I tell to python to use the GPU when I run that script? — Francesco Mantovani, Jan 20 '21 at 21:07
Or you could define the context you want in your program, not by using `cl.create_some_context()` but rather specifying the context yourself, something like `ctx = cl.Context(dev_type=cl.device_type.ALL, properties=[(cl.context_properties.PLATFORM, plat[0])]` ) — chapelo, Jan 20 '21 at 21:08
It means it is working. Set your environment variable I think using SET or editing the Registry — chapelo, Jan 20 '21 at 21:16

score 5 · Answer 1 · answered Jan 16 '21 at 23:55

After extensive research and several try I reached the conclusion:

PyOpenGL: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm
PyOpenCL: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm
TensorFlow: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm
PyTorch: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm

I installed ROCm but if I run rocminfo it returns:

ROCk module is NOT loaded, possibly no GPU devices
Unable to open /dev/kfd read-write: No such file or directory
Failed to get user name to check for video group membership
hsa api call failure at: /src/rocminfo/rocminfo.cc:1142
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

clinfo returns:

Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP (3212.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform

rocm-smi returns:

Segmentation fault

This because in the official guide it says that "The integrated GPUs of Ryzen are not officially supported targets for ROCm." and because mine is an integrated GPU I'm out of scope.

I will stop wasting my time and probably buy an NVIDIA or AMD eGPU (external GPU)

@chapelo I have the file `C:\Windows\System32\OpenCL.dll` and `/etc/OpenCL/vendors/amdocl64.icd` in Linux — Francesco Mantovani, Jan 20 '21 at 21:09
@FrancescoMantovani I believe you did an extensive research on that. However, I'm not sure if this should be considered as an answer, especially when you said "I will stop wasting my time and probably buy an NVIDIA or AMD eGPU (external GPU)". I suggest that you do an edit to your post instead. Thanks! — ndrwnaguib, Jan 22 '21 at 09:37
@AndrewNaguib, yeah, definitely, I'm waiting chapelo 's answer which I believe is a good solution. I will test it and I will probably remove my answer. — Francesco Mantovani, Jan 22 '21 at 12:42

chapelo · Answer 2 · 2021-01-21T18:27:24.220

pyopencl does work with both your AMD and your Intel GPUs. And you checked that your installation is working. Only set your environment variable PYOPENCL_CTX='0' to use the AMD every time without being asked.

Or instead of using ctx = cl.create_some_context(), you could define the context in your program by using:

platforms = cl.get_platforms()
ctx = cl.Context(
   dev_type=cl.device_type.ALL,
   properties=[(cl.context_properties.PLATFORM, platforms[0])])

Don't take for granted that your AMD is better than your Intel for every case. I have had cases when the Intel surpasses the other one. I think this has to do with the cost of copying the data outside the CPU to the other GPU.

Having said that, I think that running your script in parallel won't be too much of an improvement, as compared to having a better algorithm:

with a sieving algorithm, get the prime numbers up to the square root of your upper number.
applying a similar sieving algorithm, use the primes from the previous step to sieve numbers from your lower to upper bounds.

Perhaps this is not a good example of an algorithm that can be easily run in parallel, but you are all set up to try another example.

However, to show you how you can solve this problem using your GPU, consider the following changes:

The serial algorithm would be something like the following:

from math import sqrt

def primes_below(number):
    n = lambda a: 2 if a==0 else 2*a + 1
    limit = int(sqrt(number)) + 1
    size = number//2
    primes = [True] * size
    for i in range(1, size):
        if primes[i]:
            num = n(i)
            for j in range(i+num, size, num):
                primes[j] = False
    for i, flag in enumerate(primes):
        if flag: yield n(i)

def primes_between(lo, hi):
    primes = list(primes_below(int(sqrt(hi))+1))
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    n = lambda a: 2*a + lo + (0 if lo%2 else 1)
    numbers = [True]*size
    for i, prime in enumerate(primes):
        if i == 0: continue
        start = 0
        while (n(start)%prime) != 0: 
            start += 1
        for j in range(start, size, prime):
            numbers[j] = False
    for i, flag in enumerate(numbers):
        if flag: yield n(i)

This prints a list of the primes between 1e6 and 5e6, in 0.64 seconds

Trying to use your script with my GPU didn't finish in more than 5 minutes. For a 10 times smaller problem: the primes between 1e5 and 5e5, it took roughly 29 seconds.

Modifying the script so that each process in the GPU divides one odd number (there's no point testing even numbers) by a list of pre-computed primes up to the square root of the upper number, stopping if the prime is greater than the square root of the number itself, it completes the same task in 0.50 seconds. That's an improvement!

The code is the following:

import numpy as np
import pyopencl as cl
import pyopencl.algorithm
import pyopencl.array

def primes_between_using_cl(lo, hi):

    primes = list(primes_below(int(sqrt(hi))+1))

    numbers_h = np.arange(  lo + (0 if lo&1 else 1), 
                            hi + (0 if hi&1 else 1),
                            2,
                            dtype=np.int32)

    size = (hi - lo - (0 if hi%2 else 1))//2 + 1

    code = """\
    __kernel 
    void is_prime( __global const int *primes,
                   __global       int *numbers) {
      int gid = get_global_id(0);
      int num = numbers[gid];
      int max = (int) (sqrt((float)num) + 1.0);
      for (; *primes; ++primes) {
   
        if (*primes <= max && num % *primes == 0) {
          numbers[gid] = 0;
          return;
        }
      }
    }
    """

    platforms = cl.get_platforms()
    ctx = cl.Context(dev_type=cl.device_type.ALL,
       properties=[(cl.context_properties.PLATFORM, platforms[0])])     
    queue = cl.CommandQueue(ctx)
    prg = cl.Program(ctx, code).build()
    numbers_d = cl.array.to_device(queue, numbers_h)

    primes_d = cl.array.to_device(queue,
                                  np.array(primes[1:], # don't need 2
                                  dtype=np.int32))

    prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data)

    array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2]

    yield from array.get()[:length.get()]

Check the edit I made, I posted an example at the end. Try to run it. — chapelo, Jan 21 '21 at 18:30
your code looks promising but both scripts returns nothing https://snipboard.io/M7N1Dy.jpg . Maybe because there isn't actually a `print()` in it? Also how can I set `hi` and `lo`? — Francesco Mantovani, Jan 21 '21 at 22:05
The functions I gave you, are generators. You have to use them in your program with the input values that are relevant in each case, and the output will be used for something useful, I guess. Just printing the results does very little. Call the function from your program with the values of hi and lo (like your lower and upper values) and use the result one by one, or put them in a list. I thought you'd knew these simple things. — chapelo, Jan 21 '21 at 22:42
my knowledge are definitely lower than yours, for instance it's the first time I see a `yield` and I didn't know the existence of it before. I just want to print the results on the terminal. If I substitute `yield from` with `return` it says `NameError: name 'primes_below' is not defined`. How can I just print the numbers form 1 to 1000? I just want to check if your solution works. Thank you — Francesco Mantovani, Jan 22 '21 at 12:57
I will post my response as another answer, to keep it clear and to the point. — chapelo, Jan 22 '21 at 16:57

score 1 · Accepted Answer · answered Jan 22 '21 at 17:09

The following code is sample of a complete python program that usually includes:

The import statements
The function definitions
The main() function
The if __name__ == "__main__": section.

I hope this helps you solve your problem.

import pyprimes
from math import sqrt
import numpy as np

import pyopencl as cl
import pyopencl.algorithm
import pyopencl.array

def primes_below(number):
    """Generate a list of prime numbers below a specified  `number`"""
    n = lambda a: 2 if a==0 else 2*a + 1
    limit = int(sqrt(number)) + 1
    size = number//2
    primes = [True] * size
    for i in range(1, size):
        if primes[i]:
            num = n(i)
            if num > limit: break
            for j in range(i+num, size, num):
                primes[j] = False
    for i, flag in enumerate(primes):
        if flag:
            yield n(i)

def primes_between(lo, hi):
    """Generate a list of prime numbers betwenn `lo` and `hi` numbers"""
    primes = list(primes_below(int(sqrt(hi))+1))
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    n = lambda a: 2*a + lo + (0 if lo%2 else 1)
    numbers = [True]*size
    for i, prime in enumerate(primes):
        if i == 0: continue # avoid dividing by 2
        nlo = n(0)
        # slower # start = prime * (nlo//prime + 1) if nlo%prime else 0
        start = 0
        while (n(start)%prime) != 0: 
            start += 1
        for j in range(start, size, prime):
            numbers[j] = False
    for i, flag in enumerate(numbers):
        if flag:
            yield n(i)

def primes_between_using_cl(lo, hi):
    """Generate a list of prime numbers betwenn a lo and hi numbers
    this is a parallel algorithm using pyopencl"""
    primes = list(primes_below(int(sqrt(hi))+1))
    size_primes_h = np.array( (len(primes)-1, ), dtype=np.int32)
    numbers_h = np.arange(  lo + (0 if lo&1 else 1), 
                                  hi + (0 if hi&1 else 1),
                                  2,
                                  dtype=np.int32)
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    code = """\
    __kernel 
    void is_prime( __global const int *primes,
                        __global         int *numbers) {
      int gid = get_global_id(0);
      int num = numbers[gid];
      int max = (int) (sqrt((float)num) + 1.0);
      for (; *primes; ++primes) {
         if (*primes > max) break;
         if (num % *primes == 0) {
            numbers[gid] = 0;
            return;
         }
      }
    }
    """
    platforms = cl.get_platforms()
    ctx = cl.Context(dev_type=cl.device_type.ALL,
        properties=[(cl.context_properties.PLATFORM, platforms[0])])
    queue = cl.CommandQueue(ctx)
    prg = cl.Program(ctx, code).build()
    numbers_d = cl.array.to_device(queue, numbers_h)
    primes_d = cl.array.to_device(queue, np.array(primes[1:], dtype=np.int32))
    prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data)
    array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2]
    yield from array.get()[:length.get()]

def test(f, lo, hi):
    """Test that all prime numbers are generated by comparing with the
    output of the library `pyprimes`"""
    a = filter(lambda p: p>lo, pyprimes.primes_below(hi))
    b = f(lo, hi)
    result = True
    for p, q in zip (a, b):
        if p != q:
            print(p, q)
            result = False
    return result
    
def main():
    lower = 1000
    upper = 5000
    print("The prime numbers between {} and {}, are:".format(lower,upper))
    print()
    for p in primes_between_using_cl(lower, upper):
        print(p, end=' ')
    print()

if __name__ == '__main__':
    main()

Thank you so much @chapelo. However I still have tons of doubt: I can set `platforms[0]` or `platforms[0]` however the script is always using the Radeon and it never uses the Intel. Also: the script only takes a 10% of the GPU, not more than that: https://snipboard.io/sNqzWk.jpg — Francesco Mantovani, Jan 22 '21 at 18:33

How to run Python script on a Discrete Graphics AMD GPU?

3 Answers3