1

I have a large matrix (like 100.000 x 100.000). The good thing it contains only zeros and ones and mostly zeros (it is already saved as boolean matrix to save some RAM). Now I need to multiply each column of the matrix with all of the other columns. The reason is that I need need to check whether there is as least one row where both columns have a non-zero element (therefore multiplying and summing the resulting vector to check whether it is zero or not). As an example assume we have a matrix

1.column 2.column 3.column
1 0 0
1 1 0
0 0 1

Then I need to compare all columns and check whether there is as least one row where both columns are one. So comparing the first and the second column would return a True since they are both one in the second row. However comparing the first and third column and the second the third column would result in a Falsesince there are no rows with a row where both are one. Obviously this can be done using a for loop and iterating over all columns. However not in a very satisfying speed. I already tried numba like this:

@njit(parallel=True)
def create_dist_arr(arr: np.array):
    n = arr.shape[1]
    dist_arr = np.zeros(shape=(n, n)) #, dtype=bool)
    for i in prange(arr.shape[1]):
        for j in prange(i, arr.shape[1]):
            dist_greater_zero = calc_dist_graeter_than_zeros(arr[:, i], arr[:, j])
            dist_arr[i][j] = dist_greater_zero
            dist_arr[i][j] = dist_greater_zero
    return skill_dist_arr

@njit
def calc_dist_graeter_than_zeros(ith_col, jth_col):
    return np.sum(np.multiply(ith_col, jth_col)) != 0

zero_arr = np.zeros(shape=(2000, 6000), dtype=bool)
bool_dist_matrix = create_dist_arr(zero_arr)

But despite having 120gb Ram and 32 cores, that gets very slow around 10.000 x 10.000 matrices. Even worse is it when trying scipy.spatial.distance.pdist like this:

from scipy.spatial.distance import pdist
zero_arr = np.zeros(shape=(500, 500), dtype=bool)
bool_dist_matrix = pdist(zero_arr, lambda u, v: np.sum(np.multiply(u, v)) != 0)

Is there maybe some nice and fast workaround using sparse matrices or something else not taking like forever?

Thank you in advance :)

LaLeLo
  • 137
  • 1
  • 9
  • 2
    **The reason is that I need need to check whether there is as least one row where both columns have a non-zero element** no matter how many times I read that I dont understand what you mean? – DrBwts Jun 16 '21 at 08:10
  • Let me try an example: Assume a matrix A= [1 0 0] – LaLeLo Jun 16 '21 at 08:15
  • @LaLeTo amend the OP rather than putting it in the comments – DrBwts Jun 16 '21 at 08:16
  • 1
    yeah give me sec, i am too stupid for stackoverflow – LaLeLo Jun 16 '21 at 08:18
  • Is it clearer now what I try to do? – LaLeLo Jun 16 '21 at 08:27
  • I would have gone down the `itertools.combinations` route & and then a logical `and` between columns, then `if result > 0` you get `True` else you get `False` Then you avoid the matrix multiplication completely. Alterativley check out [Sparse Matrices](https://docs.scipy.org/doc/scipy/reference/sparse.html) – DrBwts Jun 16 '21 at 08:49
  • 1
    How dense is the matrix (proportion of non zero entries?) If the matrix is quite sparse it is important to make use of this (this can be orders of magnitude faster with lower memory usage). After you have created a sparse matrix (csc or csr depending on the operation) you can work with this sparse data. This an example on the maximum value of a sparse matrix and a vector to give an idea ho this can be implemented: https://stackoverflow.com/a/64920528/4045774 – max9111 Jun 16 '21 at 12:33

3 Answers3

3

Idk how memory efficient this solution is nor if it will be faster than the other solutions but it is vectorized.

Your idea was to multiply the columns and add the result. which reminded me of matrix multiplication, except its against its own columns. so..

Lets say your matrix is M. If you take the transverse M.T and matrix multiply against itself, M.T @ M it will be the same as multiplying each column against the other columns and taking the sum.

import numpy as np

# A random matrix
M = np.array([[1,0,0,0,0,1],
              [1,1,0,0,0,1],
              [0,1,1,1,0,0]])

bool_dist_matrix = (M.T @ M).astype('bool')
np.fill_diagonal(bool_dist_matrix, 0)

"""
[[0 1 0 0 0 1]
 [1 0 1 1 0 1]
 [0 1 0 1 0 0]
 [0 1 1 0 0 0]
 [0 0 0 0 0 0]
 [1 1 0 0 0 0]]
"""
Ta946
  • 1,342
  • 12
  • 19
  • 2
    This solution could be memory efficient and fast if you use sparse matrices for M. Instead of `np.fill_diagonal(bool_dist_matrix, 0)` you could use `bool_dist_matrix.a.setdiag(0)` then. – max9111 Jun 16 '21 at 16:00
1

If your matrix is mainly composed by zeros, then it will be more efficient to work with index:

import numpy as np

# A random matrix
M = np.array([[1,0,0,0,0,1],
              [1,1,0,0,0,1],
              [0,1,1,1,0,0]])

# Get the index where M == 1
ind = np.where(M)
# Get the unique value and return the count.
uni = np.unique(ind[0], return_counts=True)
# Keep only the column that have at least two 1 in the same row.
col = uni[0][uni[1]>1]
# Create a boolean index.
bol = np.isin(ind[0],col)

# And here we get the "compressed" result:
res = ind[1][bol] #Col number
grp = ind[0][bol] #Group
# res = [0, 5, 0, 1, 5, 1, 2, 3]
# grp = [1, 1, 2, 2, 2, 3, 3, 3]
# So each pair of each group will output a True statement:
# group 1: 0-5
# group 2: 0-1, 0-5, 1-5
# group 3: 1-2, 1-3, 2-3
# For an explicit result you could use itertools. But all the information is 
# contained in the res variable.

Noticed that this method will produce some duplicate if, for multiple rows, two column have a common 1 value. But it will be easy to get rid of those duplicate. But since you're working with a 100000x100000 matrix and that not all columns have at least one common 1 value, it is very likely that the percentage of 1 in your matrix is very low, so this method should provide some good result.

obchardon
  • 10,614
  • 1
  • 17
  • 33
0

I'm not sure what you meant you need to do, but if i understood correctly my code should help. Here's what i tried. It seems to run a little faster. Although that won't be the case if the matrix isn't sparse as you mentioned and i get a symmetric matrix and not an upper triangular one:

import numpy as np
from numba import njit, prange
@njit(parallel=True)
def create_dist_arr(arr: np.array):
    n = arr.shape[1]
    dist_arr = np.zeros(shape=(n, n)) #, dtype=bool)
    for i in prange(arr.shape[1]):
        for j in prange(i, arr.shape[1]):
            dist_greater_zero = calc_dist_graeter_than_zeros(arr[:, i], arr[:, j])
            dist_arr[i][j] = dist_greater_zero
            dist_arr[i][j] = dist_greater_zero
    return dist_arr

@njit
def calc_dist_graeter_than_zeros(ith_col, jth_col):
    return np.sum(np.multiply(ith_col, jth_col)) != 0

def create_dist_arr_sparse(arr: np.array):
    n = arr.shape[1]
    dist_arr = np.zeros(shape=(n, n)) #, dtype=bool)
    rows, cols = np.array(np.where(arr))
    same_rows = rows.reshape(1, -1) == rows.reshape(-1, 1)
    idx0, idx1 = np.meshgrid(cols, cols)
    idx0 = idx0.flatten()[same_rows.flatten()]
    idx1 = idx1.flatten()[same_rows.flatten()]
    dist_arr[idx0, idx1] = 1
    return dist_arr

np.random.seed(1)
k = 1000
zero_arr = np.zeros(shape=(k, k), dtype=bool)
rows, cols = np.random.randint(0, k, (2, k))
zero_arr[rows, cols] = 1
%timeit bool_dist_matrix = create_dist_arr(zero_arr)
%timeit bool_dist_matrix = create_dist_arr_sparse(zero_arr)

output:

1 loop, best of 5: 1.59 s per loop
100 loops, best of 5: 9.8 ms per loop
yann ziselman
  • 1,952
  • 5
  • 21