Numpy matrix multiplication but instead of multiplying it XOR's elements

Question

as stated in the title I want to make a calculation where instead of multiplying corresponding elements I binary XOR them and then add them. Example for illustration:

EDIT: The big picture above IS the calculation but here we go: Take first row from the left [1 0 1] and first column from top matrix [1 0 0]. 1 XOR 1 = 0, 0 XOR 0 = 0, 1 XOR 0 = 1. Add them all 0 + 0 + 1 = 1. First row from the left [1 0 1], second column [0 0 0]: 1 XOR 0 = 1, 0 XOR 0 = 0, 1 XOR 0 = 1. Add them all 1 + 0 + 1 = 2. And so on

Is it possible to do that in numpy?

Could you elaborate with at least one calculated example (one matrix element)? — sai, Apr 08 '21 at 14:56
Seems related (maybe duplicate): https://stackoverflow.com/questions/50009029/hamming-distance-in-numpy — Damien, Apr 08 '21 at 15:17

Kevin · Accepted Answer · 2021-04-08T15:41:33.977

4

Try this:

M1 = np.array([[1, 0, 0], [0, 0, 0], [0, 0, 0]])
M2 = np.array([[1, 0, 1], [0, 0, 1], [1, 1, 1]])
(M1 ^ M2[:,None]).sum(-1)

Output:

array([[1, 2, 2],
       [2, 1, 1],
       [2, 3, 3]])

EDIT

If you want to preallocate memory:

intermediary = np.empty((3,3,3), dtype=np.int32)
np.bitwise_xor(M1, M2[:,None], out=intermediary).sum(-1)

edited Apr 08 '21 at 15:41

answered Apr 08 '21 at 15:29

Kevin

3,096
2
8
37

Although solution looks very short, it may overflow memory for huge matrices, and OP said he has huge matrices, for example for two matrices of 1000x1000 size (1MB size) it produces intermediate matrix of size 1000x1000x1000 (1GB!!!). – Arty Apr 08 '21 at 15:35
1GB is not that huge by modern standards – Mad Physicist Apr 08 '21 at 15:36
You could specify the ```out``` keyword, that will prevent internal memory allocation. – Kevin Apr 08 '21 at 15:37
@MadPhysicist I just imagined 1000x1000 matrix, OP may have 10000x10000 (100MB) matrix, then intermediate matrix will be 10000x10000x10000 (1 TeraByte!!!). – Arty Apr 08 '21 at 15:40
@Kevin. How would that help if you don't have an N**3 sized array already allocated? – Mad Physicist Apr 08 '21 at 15:41
@MadPhysicist, that is true, you will need to allocate a N**3 somehow for this to work. – Kevin Apr 08 '21 at 15:42
Pre-allocation will marginally slow things down most likely, since you're adding a useless function call... – Mad Physicist Apr 08 '21 at 15:42

max9111 · Answer 2 · 2021-04-09T13:13:06.527

This is just a longer comment on Artys answer. There are a few things to speed up the Numba function.

Steps to improve performance

import numpy as np, numba

m1 = np.random.randint(low=0, high=1,size=1_000_000).reshape(1_000,1_000)
m2 = np.random.randint(low=0, high=1,size=1_000_000).reshape(1_000,1_000)

#@Arty
@numba.njit(cache = True)
def matxor_1(m1, m2):
    mr = np.empty((m2.shape[0], m1.shape[1]), dtype = np.int64)
    for i in range(mr.shape[0]):
        for j in range(mr.shape[1]):
            mr[i, j] = np.sum(m1[:, j] ^ m2[i, :])
    return mr

%timeit matxor_1(m1, m2)
#1.06 s ± 9.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Aligned memory acces (real transpose the ascontiguousarray is important)
@numba.njit(cache = True)
def matxor_2(m1, m2):
    mr = np.empty((m2.shape[0], m1.shape[1]), dtype = np.int64)
    m1_T=np.ascontiguousarray(m1.T)

    for i in range(mr.shape[0]):
        for j in range(mr.shape[1]):
            mr[i, j] = np.sum(m1_T[j, :] ^ m2[i, :])
    return mr

%timeit matxor_2(m1, m2)
#312 ms ± 7.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Writing out the inner loop
@numba.njit(fastmath=True,cache = True)
def matxor_3(m1, m2):
    mr = np.empty((m2.shape[0], m1.shape[1]), dtype = np.int64)
    m1_T=np.ascontiguousarray(m1.T)

    for i in range(mr.shape[0]):
        for j in range(mr.shape[1]):
            acc=0
            for k in range(m2.shape[1]):
                acc+=m1_T[j, k] ^ m2[i, k]
            mr[i, j] = acc
    return mr

%timeit matxor_3(m1, m2)
#125 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Parallelization
@numba.njit(fastmath=True,cache = True,parallel=True)
def matxor_4(m1, m2):
    mr = np.empty((m2.shape[0], m1.shape[1]), dtype = np.int64)
    m1_T=np.ascontiguousarray(m1.T)

    for i in numba.prange(mr.shape[0]):
        for j in range(mr.shape[1]):
            acc=0
            for k in range(m2.shape[1]):
                acc+=m1_T[j, k] ^ m2[i, k]
            mr[i, j] = acc
    return mr

%timeit matxor_4(m1, m2)
#23.8 ms ± 711 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(np.allclose(matxor_1(m1, m2),matxor_2(m1, m2)))
#True
print(np.allclose(matxor_1(m1, m2),matxor_3(m1, m2)))
#True
print(np.allclose(matxor_1(m1, m2),matxor_4(m1, m2)))
#True

@Arty Maybe it would be good if you just integrate the improvement steps in your answer and I delete this answer afterwards. As said it was just a bit longer comment. — max9111, Apr 09 '21 at 13:15
Thanks! Integrated into my answer. If something is wrong and/or needs corrections then just edit my answer. — Arty, Apr 09 '21 at 13:29

Arty · Answer 3 · 2021-04-09T13:45:13.563

You can just make a combination of two loops and Numpy 1D xor-sum, like below:

Try it online!

import numpy as np
m1 = np.array([[1, 0, 0], [0, 0, 0], [0, 0, 0]])
m2 = np.array([[1, 0, 1], [0, 0, 1], [1, 1, 1]])
mr = np.empty((m2.shape[0], m1.shape[1]), dtype = np.int64)
for i in range(mr.shape[0]):
    for j in range(mr.shape[1]):
        mr[i, j] = np.sum(m1[:, j] ^ m2[i, :])
print(mr)

Output:

[[1 2 2]
 [2 1 1]
 [2 3 3]]

As @MadPhysicist suggested you can use Numba JIT-optimizer (pip install numba) to boost code above and you'll get very fast code for you operations with small memory consumption:

Try it online!

import numpy as np, numba
@numba.njit(cache = True)
def matxor(m1, m2):
    mr = np.empty((m2.shape[0], m1.shape[1]), dtype = np.int64)
    for i in range(mr.shape[0]):
        for j in range(mr.shape[1]):
            mr[i, j] = np.sum(m1[:, j] ^ m2[i, :])
    return mr
m1 = np.array([[1, 0, 0], [0, 0, 0], [0, 0, 0]])
m2 = np.array([[1, 0, 1], [0, 0, 1], [1, 1, 1]])
print(matxor(m1, m2))

Also Numba code above can be boosted up to 44x times more thanks to following great improvements suggested and coded by @max9111:

import numpy as np, numba

m1 = np.random.randint(low=0, high=1,size=1_000_000).reshape(1_000,1_000)
m2 = np.random.randint(low=0, high=1,size=1_000_000).reshape(1_000,1_000)

#@Arty
@numba.njit(cache = True)
def matxor_1(m1, m2):
    mr = np.empty((m2.shape[0], m1.shape[1]), dtype = np.int64)
    for i in range(mr.shape[0]):
        for j in range(mr.shape[1]):
            mr[i, j] = np.sum(m1[:, j] ^ m2[i, :])
    return mr

%timeit matxor_1(m1, m2)
#1.06 s ± 9.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Aligned memory acces (real transpose the ascontiguousarray is important)
@numba.njit(cache = True)
def matxor_2(m1, m2):
    mr = np.empty((m2.shape[0], m1.shape[1]), dtype = np.int64)
    m1_T=np.ascontiguousarray(m1.T)

    for i in range(mr.shape[0]):
        for j in range(mr.shape[1]):
            mr[i, j] = np.sum(m1_T[j, :] ^ m2[i, :])
    return mr

%timeit matxor_2(m1, m2)
#312 ms ± 7.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Writing out the inner loop
@numba.njit(fastmath=True,cache = True)
def matxor_3(m1, m2):
    mr = np.empty((m2.shape[0], m1.shape[1]), dtype = np.int64)
    m1_T=np.ascontiguousarray(m1.T)

    for i in range(mr.shape[0]):
        for j in range(mr.shape[1]):
            acc=0
            for k in range(m2.shape[1]):
                acc+=m1_T[j, k] ^ m2[i, k]
            mr[i, j] = acc
    return mr

%timeit matxor_3(m1, m2)
#125 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Parallelization
@numba.njit(fastmath=True,cache = True,parallel=True)
def matxor_4(m1, m2):
    mr = np.empty((m2.shape[0], m1.shape[1]), dtype = np.int64)
    m1_T=np.ascontiguousarray(m1.T)

    for i in numba.prange(mr.shape[0]):
        for j in range(mr.shape[1]):
            acc=0
            for k in range(m2.shape[1]):
                acc+=m1_T[j, k] ^ m2[i, k]
            mr[i, j] = acc
    return mr

%timeit matxor_4(m1, m2)
#23.8 ms ± 711 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(np.allclose(matxor_1(m1, m2),matxor_2(m1, m2)))
#True
print(np.allclose(matxor_1(m1, m2),matxor_3(m1, m2)))
#True
print(np.allclose(matxor_1(m1, m2),matxor_4(m1, m2)))
#True

@Kyatt Depends how huge, what is theirs maximal size? This code above is quite close to optimal, because most of time is spent inside xor-sum, not inside loops, so most of time is inside Numpy, and you can't make it much faster. — Arty, Apr 08 '21 at 15:10
@Kyatt For example for matrices sizes of 1000 x 1000 is quite alright and will be computed in a fraction of second. Just try on your data. — Arty, Apr 08 '21 at 15:12
@Kyatt I meant that 1000x1000 matrix is not considered large and will be computed fast. In my code you can use any non-square sizes, e.g. 500x700. — Arty, Apr 08 '21 at 15:15
@Kyatt Also in my code I used `np.int64`, maybe `np.uint32` or `np.uint16` is enough for your task, I don't know. If you use `np.uint32` instead of `np.int64` you'll have twice less memory usage. — Arty, Apr 08 '21 at 15:20
I most of the time is spent in the overhead of numpy because you keep calling the functions over and over on the python side. It's not spent doing the actual calculation. — Mad Physicist, Apr 08 '21 at 15:35
Compile this with numba njit and you have a winner in speed and memory consumption — Mad Physicist, Apr 08 '21 at 15:37
@MadPhysicist If matrices are huge like OP said, e.g. 1000x1000 then for each loop iteration we spend around 2000 operations within Numpy code. Even if one function call one one loop iteration are expensive, still they are not that expensive as 2000 C-language numpy operations. So at most 10-20% of time we spend in python and around 80-90% of time in Numpy, it is not that bad! — Arty, Apr 08 '21 at 15:38
@MadPhysicist Added to my answer your suggestion regarding Numba! Thanks! — Arty, Apr 08 '21 at 15:45
@Kyatt Please put a look at my answer, just updated it, added Numba-optimized code. It will produce fastest possible code for your operation, you can't get faster than that! — Arty, Apr 08 '21 at 15:47
I wrote a bit longer comment on your Numba version https://stackoverflow.com/a/67017737/4045774, since there are quite a few things to improve. The speedup is about 44x. on 1000x1000 arrays. — max9111, Apr 09 '21 at 08:37

score 0 · Answer 4 · answered Aug 25 '22 at 23:20

I was trying to find something that is simpler for me to understand and runs faster. So, I came up with this.

To try and avoid loops, we can convert this problem into a matrix multiplication problem. Matrix multiplications are highly optimized in libraries like numpy(I guess).

XOR operation followed by summing calculates number of dissimilar bits between two bit vectors. We could count this value indirectly by using bipolar vectors, i.e. converting the 1,0 bit vectors to 1,-1 vectors.

Steps:

The matrix can be seen as a stack of bit vectors of length m. You can assume these bit vectors to be the rows of M1 and the columns of M2.
Obtain the product matrix M1 x M2.

Recover the number of dissimilar bits from the values in the product matrix using the following formula.

Each bit vector has m bits. m = length of bit vector
When comparing two bit vectors,
let k = number of similar bits
=> m-k = number of dissimilar bits
let r = intermediate result = result obtained by taking dot product of two 1,-1 vectors
clearly,
r = 1*k + (-1) (m-k)
r = k -m + k
r = 2k -m  --(1)
OR
k = (m+r)/2 --(2) = sum(XNOR)

Given bit vectors, we know m. We can compute r as descibed above.
We can compute k using (2)

But we need the number of dissimilar bits (m-k)
m-k = m - ((m+r)/2) = (2m-m-r)/2 = (m-r)/2
number of dissimilar bits = (m-k) = (m-r)/2 = sum(XOR)

Doing the same at the matrix level, we get:

Actual code:

import numpy as np

def matXOR_matXNOR(M1,M2):

    N1 = 2*M1 - 1; N2 = 2*M2 - 1 #convert to bipolar vectors
    m1, n = N1.shape #N2 is n x m2 matrix
    intr = np.matmul(N1,N2) #m1 x m2 matrix -> intermediate result

    xorMat = (m-intr)//2
    xnorMat = (m+intr)//2

    return xorMat, xnorMat

Numpy matrix multiplication but instead of multiplying it XOR's elements

4 Answers4

Linked