Solving Linear Equations on the GPU with NumPy and PyTorch

Question

I am trying to solve a lot of linear equations as fast as possible. To find out the fastest way I benchmarked NumPy and PyTorch, each on the CPU and on my GeForce 1080 GPU (using Numba for NumPy). The results really confused me.

This is the code I used with Python 3.8:

import timeit

import torch
import numpy
from numba import njit


def solve_numpy_cpu(dim: int = 5):
    a = numpy.random.rand(dim, dim)
    b = numpy.random.rand(dim)

    for _ in range(1000):
        numpy.linalg.solve(a, b)


def solve_numpy_njit_a(dim: int = 5):
    njit(solve_numpy_cpu, dim=dim)


@njit
def solve_numpy_njit_b(dim: int = 5):
    a = numpy.random.rand(dim, dim)
    b = numpy.random.rand(dim)

    for _ in range(1000):
        numpy.linalg.solve(a, b)


def solve_torch_cpu(dim: int = 5):
    a = torch.rand(dim, dim)
    b = torch.rand(dim, 1)

    for _ in range(1000):
        torch.solve(b, a)


def solve_torch_gpu(dim: int = 5):
    torch.set_default_tensor_type("torch.cuda.FloatTensor")
    solve_torch_cpu(dim=dim)


def main():
    for f in (solve_numpy_cpu, solve_torch_cpu, solve_torch_gpu, solve_numpy_njit_a, solve_numpy_njit_b):
        time = timeit.timeit(f, number=1)
        print(f"{f.__name__:<20s}: {time:f}")


if __name__ == "__main__":
    main()

And these are the results:

solve_numpy_cpu     : 0.007275
solve_torch_cpu     : 0.012244
solve_torch_gpu     : 5.239126
solve_numpy_njit_a  : 0.000158
solve_numpy_njit_b  : 1.273660

The slowest is CUDA accelerated PyTorch. I verified that PyTorch is using my GPU with

import torch
torch.cuda.is_available()
torch.cuda.get_device_name(0)

returning

True
'GeForce GTX 1080'

I can get behind that, on the CPU, PyTorch is slower than NumPy. What I cannot understand is why PyTorch on the GPU is so much slower. Not that important but actually even more confusing is that Numba's njit decorator makes performance orders of magnitude slower, until you don't use the @ decorator syntax anymore.

Is it my setup? Occasionally I get a weird message about the windows page / swap file not being big enough. In case I've taken a completely obscure path to solving linear equations on the GPU, I'd be happy to be directed into another direction.

Edit

So, I focussed on Numba and changed my benchmarking a bit. As suggested by @max9111 I rewrote the functions to receive input and produce output because, in the end, that's what anyone would want to use them for. Now, I also perform a first compile run for the Numba accelerated function so the subsequent timing is fairer. Finally, I checked the performance against matrix size and plotted the results.

TL/DR: Up to matrix sizes of 500x500, Numba acceleration doesn't really make a difference for numpy.linalg.solve.

Here is the code:

import time
from typing import Tuple

import numpy
from matplotlib import pyplot
from numba import jit


@jit(nopython=True)
def solve_numpy_njit(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
    parameters = numpy.linalg.solve(a, b)
    return parameters


def solve_numpy(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
    parameters = numpy.linalg.solve(a, b)
    return parameters


def get_data(dim: int) -> Tuple[numpy.ndarray, numpy.ndarray]:
    a = numpy.random.random((dim, dim))
    b = numpy.random.random(dim)
    return a, b


def main():
    a, b = get_data(10)
    # compile numba function
    p = solve_numpy_njit(a, b)

    matrix_size = [(x + 1) * 10 for x in range(50)]
    non_accelerated = []
    accelerated = []
    results = non_accelerated, accelerated

    for j, each_matrix_size in enumerate(matrix_size):
        for m, f in enumerate((solve_numpy, solve_numpy_njit)):
            average_time = -1.
            for k in range(5):
                time_start = time.time()
                for i in range(100):
                    a, b = get_data(each_matrix_size)
                    p = f(a, b)
                d_t = time.time() - time_start
                print(f"{each_matrix_size:d} {f.__name__:<30s}: {d_t:f}")
                average_time = (average_time * k + d_t) / (k + 1)
            results[m].append(average_time)

    pyplot.plot(matrix_size, non_accelerated, label="not numba")
    pyplot.plot(matrix_size, accelerated, label="numba")
    pyplot.legend()
    pyplot.show()


if __name__ == "__main__":
    main()

And these are the results (runtime against matrix edge length):

Edit 2

Seeing that Numba doesn't make much of a difference in my case, I came back to benchmarking PyTorch. And indeed, it appears to be roughly 4x faster than Numpy without even using a CUDA device.

Here is the code I used:

import time
from typing import Tuple

import numpy
import torch
from matplotlib import pyplot


def solve_numpy(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
    parameters = numpy.linalg.solve(a, b)
    return parameters


def get_data(dim: int) -> Tuple[numpy.ndarray, numpy.ndarray]:
    a = numpy.random.random((dim, dim))
    b = numpy.random.random(dim)
    return a, b


def get_data_torch(dim: int) -> Tuple[torch.tensor, torch.tensor]:
    a = torch.rand(dim, dim)
    b = torch.rand(dim, 1)
    return a, b


def solve_torch(a: torch.tensor, b: torch.tensor) -> torch.tensor:
    parameters, _ = torch.solve(b, a)
    return parameters


def experiment_numpy(matrix_size: int, repetitions: int = 100):
    for i in range(repetitions):
        a, b = get_data(matrix_size)
        p = solve_numpy(a, b)


def experiment_pytorch(matrix_size: int, repetitions: int = 100):
    for i in range(repetitions):
        a, b = get_data_torch(matrix_size)
        p = solve_torch(a, b)


def main():
    matrix_size = [x for x in range(5, 505, 5)]
    experiments = experiment_numpy, experiment_pytorch
    results = tuple([] for _ in experiments)

    for i, each_experiment in enumerate(experiments):
        for j, each_matrix_size in enumerate(matrix_size):
            time_start = time.time()
            each_experiment(each_matrix_size, repetitions=100)
            d_t = time.time() - time_start
            print(f"{each_matrix_size:d} {each_experiment.__name__:<30s}: {d_t:f}")
            results[i].append(d_t)

    for each_experiment, each_result in zip(experiments, results):
        pyplot.plot(matrix_size, each_result, label=each_experiment.__name__)

    pyplot.legend()
    pyplot.show()


if __name__ == "__main__":
    main()

And here's the result (runtime against matrix edge length):

So for now, I'll be sticking with torch.solve. However, the original question remains:

How can I exploit my GPU to solve linear equations even faster?

Make use of benchmarks with outputs. In compilers like Numba, there is usually a dead code elimination -> If this works, you are measuring only the function call, but no calculation (njit_a) Also avoid defaults, this can lead to repeated recompilation, which is slow (njit_b) GPUs are usually much slower if you calculate such small problems (not parallelizeable, slow data transfer, maybe also slow function call) — max9111, May 30 '20 at 10:20
@max9111 thanks for your comment! i'll look into using outputs and remove the defaults. concerning the parallelizeablility of solving linear equations: i thought that the underlying gauss elimination was parallelizeable. that's why i hoped to benefit from gpu acceleration. am i in the wrong here? — wehnsdaefflae, May 30 '20 at 11:55
What is your real world use case? Solving a lot of tiny equation systems -> calculate them in paralle on cpu (nb.parfor). Solving larger equation sytems, which are parallelizeable on their own-> CPU or GPU. But you should also consider, that the GPU is much faster on single precision calculation. — max9111, May 30 '20 at 18:26
@max9111 it's an online regression algorithm that updates its parameters with each training example. so each system of equations is solved sequentially. the solving itself, however, should be in parallel which, as i understand it, should be possible using the jacobi method. — wehnsdaefflae, May 30 '20 at 18:58
But not for a 5x5 matrix. It must be much larger that parallelization is beneficial. Numpy with MKL or Numba, which calls the same solving algorithm would automitically parallelize larger problems. — max9111, May 30 '20 at 20:18
@max9111 i tried it with a 30x30 matrix and still, no difference. can you give a rough estimate concerning the order of magnitude where parallelization becomes beneficial? 100x100 or rather 1000x1000 and above? — wehnsdaefflae, May 30 '20 at 20:24
In my numpy-MKL it switches to parallel mode at 150 and above. On very tiny iterative algorithms other things matters more (like avoiding memory allocations,..)-> quite low level for Python BTW: Is solving such tiny equation-systems really the limiting part? It takes 8µs per call on my system... — max9111, May 30 '20 at 21:06
I guess running on GPU has an additional overhead of transferring data to the GPU memory that you are not taking into account. — sisaman, Apr 25 '22 at 07:14
+1 for everyone who said that GPU is not always faster than CPU. You need to set the problem up in a specific way for it to be GPU-friendly. And even then it may be that the overhead is too much. — Eljas Hyyrynen, Mar 21 '23 at 08:22

iacob · Accepted Answer · 2023-07-15T18:12:30.867

Your analysis is correct on several fronts, but there are a couple of nuances that might help clarify your results and improve GPU performance:

1. CPU vs GPU Performance In general, GPU operations have an overhead cost associated with transferring data between the CPU and GPU memory. Therefore, the benefits of GPU acceleration often become apparent with larger data sets, where the benefits of parallelization outweigh this overhead. This overhead cost is likely why the GPU computations are slower for small matrices. To exploit your GPU for solving linear equations, you should focus on larger matrices.

2. Torch solve vs torch.linalg.solve The torch.solve function has been deprecated since PyTorch 1.7.0. You might get better performance and more accurate results with torch.linalg.solve.

3. Numba's njit Performance Numba's @njit decorator accelerates Python functions by generating optimized machine code using the LLVM compiler infrastructure at import time. When you use the @njit decorator, Numba compiles the function in no-Python mode which may lead to slower performance if the function cannot be fully optimized. The first run will also include a "compilation" step, which can make it appear much slower if it is included in the timing.

4. Using CUDA Memory Efficiently The line torch.set_default_tensor_type("torch.cuda.FloatTensor") in your solve_torch_gpu function sets the default tensor type to CUDA tensors. Every tensor created afterward will be a CUDA tensor. This might lead to unnecessary usage of GPU memory and slow down the calculations. If you create your tensors directly on GPU when you need them (using .to(device) where device is your CUDA device), it will be more efficient and might improve your computation time.

Here's a revised version of your function that uses torch.linalg.solve and directly creates tensors on the GPU:

def solve_torch_gpu_optimized(dim: int = 5):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    a = torch.rand(dim, dim, device=device)
    b = torch.rand(dim, 1, device=device)

    for _ in range(1000):
        torch.linalg.solve(a, b)

Solving Linear Equations on the GPU with NumPy and PyTorch

Edit

Edit 2

1 Answers1