I have a large matrix (like 100.000 x 100.000). The good thing it contains only zeros and ones and mostly zeros (it is already saved as boolean matrix to save some RAM). Now I need to multiply each column of the matrix with all of the other columns. The reason is that I need need to check whether there is as least one row where both columns have a non-zero element (therefore multiplying and summing the resulting vector to check whether it is zero or not). As an example assume we have a matrix
1.column | 2.column | 3.column |
---|---|---|
1 | 0 | 0 |
1 | 1 | 0 |
0 | 0 | 1 |
Then I need to compare all columns and check whether there is as least one row where both columns are one. So comparing the first and the second column would return a True since they are both one in the second row. However comparing the first and third column and the second the third column would result in a Falsesince there are no rows with a row where both are one. Obviously this can be done using a for loop and iterating over all columns. However not in a very satisfying speed. I already tried numba like this:
@njit(parallel=True)
def create_dist_arr(arr: np.array):
n = arr.shape[1]
dist_arr = np.zeros(shape=(n, n)) #, dtype=bool)
for i in prange(arr.shape[1]):
for j in prange(i, arr.shape[1]):
dist_greater_zero = calc_dist_graeter_than_zeros(arr[:, i], arr[:, j])
dist_arr[i][j] = dist_greater_zero
dist_arr[i][j] = dist_greater_zero
return skill_dist_arr
@njit
def calc_dist_graeter_than_zeros(ith_col, jth_col):
return np.sum(np.multiply(ith_col, jth_col)) != 0
zero_arr = np.zeros(shape=(2000, 6000), dtype=bool)
bool_dist_matrix = create_dist_arr(zero_arr)
But despite having 120gb Ram and 32 cores, that gets very slow around 10.000 x 10.000 matrices. Even worse is it when trying scipy.spatial.distance.pdist like this:
from scipy.spatial.distance import pdist
zero_arr = np.zeros(shape=(500, 500), dtype=bool)
bool_dist_matrix = pdist(zero_arr, lambda u, v: np.sum(np.multiply(u, v)) != 0)
Is there maybe some nice and fast workaround using sparse matrices or something else not taking like forever?
Thank you in advance :)