The function np.dot multiplies the GF4 field matrices for a very long time

Question

Multiplies large matrices for a very long time. How can this problem be solved. I use the galois library, and numpy, I think it should still work stably. I tried to implement my GF4 arithmetic and multiplied matrices using numpy, but it takes even longer. Thank you for your reply.

When r = 2,3,4,5,6 multiplies quickly, then it takes a long time. As for me, these are not very large sizes of matrices. This is just a code snippet. I get the sizes n, k of matrices of a certain family given r. And I need to multiply the matrices of those obtained parameters.

import numpy as np
import galois


def family_Hamming(q,r):
    n = int((q**r-1)/(q-1))
    k = int((q**r-1)/(q-1)-r)
    res = (n,k)
    return res

q = 4
r = 7

n,k = family_Hamming(q,r)

GF = galois.GF(2**2)

#(5461,5461)
a = GF(np.random.randint(4, size=(k, k)))
#(5454,5461)
b = GF(np.random.randint(4, size=(k, n)))
c = np.dot(a,b)
print(c)

What kind of speed are you expecting? Multiplying two 5000x5000 magrices together is a pretty heavy operation — Dario Petrillo, Nov 30 '22 at 09:05
At least 20 seconds. But not more than a minute as it does., — Дима Бобрикович, Nov 30 '22 at 09:08
Is using `galois` any faster than using pure NumPy? If so, it might be as good as it gets, if not, you should probably file a bug for `galois`, as according to them it should be *"faster than NumPy"*. — norok2, Nov 30 '22 at 09:25
Perhaps you can speed this up with CUDA? Though you will need access to a GPU; or you can use the free one on Google Colab for your calculation. Pretty sure you can wrap your matrices in jax/numba/torch etc and just do the matmul faster on cuda. — Mercury, Nov 30 '22 at 09:29
It turns out the problem is in the halo, it loops. I tried to write my own, but it also loops. — Дима Бобрикович, Nov 30 '22 at 09:43
I'm the author of `galois`. The algorithm is the O(N^3) one, however it is JIT compiled -- so it should be much faster than in pure Python. Using CUDA won't work until CUDA kernels are created for the finite field arithmetic (in a future release of `galois`). I'm tracking this issue in https://github.com/mhostetter/galois/issues/439. I have some ideas for parallelizing some of the computations with Numba. Stay tuned for a 0.2.1 release with performance improvements. — Matt Hostetter, Nov 30 '22 at 14:43

score 0 · Answer 1 · answered Nov 30 '22 at 09:06

0

I'm not sure if it is actually faster but np.dot should be used for the dot product of two vectors, for matrix multiplication use A @ B. That's as efficient as you can get with Python as far as I know

answered Nov 30 '22 at 09:06

yagod

330
1
8

I tried matmul, it also works for a very long time – Дима Бобрикович Nov 30 '22 at 09:15
@yagod, `np.dot` seems faster than `@` to me; for a (1000, 1000) array `np.dot` returns in 1.4 sec while `@` returns in 2.2 secs. – isCzech Nov 30 '22 at 09:24

score 0 · Answer 2 · answered Dec 09 '22 at 15:22

I'm the author of galois. I added performance improvements to matrix multiplication in v0.3.0 by parallelizing the arithmetic over multiple cores. The next performance improvement will come once GPU support is added.

I'm open to other performance improvement suggestions, but as far as I know the algorithm is running as fast as possible on a CPU.

In [1]: import galois

In [2]: GF = galois.GF(2**2)

In [3]: A = GF.Random((300, 400), seed=1)

In [4]: B = GF.Random((400, 500), seed=2)

# v0.2.0
In [5]: %timeit A @ B
1.02 s ± 7.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# v0.3.0
In [5]: %timeit A @ B
99 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Mercury · Answer 3 · 2022-11-30T09:52:22.873

-1

Try using jax on a CUDA runtime. For example, you can try it out on Google Colab's free GPU. (Open a notebook -> Runtime -> Change runtime type -> GPU).

import jax.numpy as jnp
from jax import device_put

a = GF(np.random.randint(4, size=(k, k)))
b = GF(np.random.randint(4, size=(k, n)))

a, b = device_put(a), device_put(b)
c = jnp.dot(a, b)

c = np.asarray(c)

Timing test:

%timeit jnp.dot(a, b).block_until_ready()
# 765 ms ± 96.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Nov 30 '22 at 09:52

answered Nov 30 '22 at 09:46

Mercury

3,417
1
10
35

1

This will compute the matrix multiplication in Z, not the finite field. – Matt Hostetter Nov 30 '22 at 14:13
I see. I have no idea about finite field or the galois library, just wrote a quick solution for multiplying massive np matrices on cuda. I did a quick lookup just now: the O(N^3) matmul is happening [here](https://github.com/mhostetter/galois/blob/master/src/galois/_domains/_linalg.py#L224) I presume? Wonder if that can be pulled out as an isolated cuda-supported function as a temporary fix somehow. – Mercury Nov 30 '22 at 15:23

The function np.dot multiplies the GF4 field matrices for a very long time

3 Answers3