Scipy: Sparse indicator matrix from array(s)

Question

What is the most efficient way to compute a sparse boolean matrix I from one or two arrays a,b, with I[i,j]==True where a[i]==b[j]? The following is fast but memory-inefficient:

I = a[:,None]==b

The following is slow and still memory-inefficient during creation:

I = csr((a[:,None]==b),shape=(len(a),len(b)))

The following gives at least the rows,cols for better csr_matrix initialization, but it still creates the full dense matrix and is equally slow:

z = np.argwhere((a[:,None]==b))

Any ideas?

I guess I could do the whole tour with argsorting a, detect the indeces where sorted array changes, calculate the combinations of each size of partition determined this way, unsort again... But I hope there is a simple numpy or scipy function at hand... — Radio Controlled, Feb 16 '19 at 09:06

Thomas Kühn · Accepted Answer · 2019-02-17T13:16:55.643

One way to do it would be to first identify all different elements that a and b have in common using sets. This should work well if there are not very many different possibilities for the values in a and b. One then would only have to loop over the different values (below in variable values) and use np.argwhere to identify the indices in a and b where these values occur. The 2D indices of the sparse matrix can then be constructed using np.repeat and np.tile:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))

##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))

##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []

##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
    x = np.argwhere(a==value).ravel()
    y = np.argwhere(b==value).ravel()    
    rows.append(np.repeat(x, len(x)))
    cols.append(np.tile(y, len(y)))

##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)

##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )

##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)

The syntax for generating the csr matrix is taken from the documentation. The test for sparse matrix equality is taken from this post.

Old Answer:

I don't know about performance, but at least you can avoid constructing the full dense matrix by using a simple generator expression. Here some code that uses two 1d arras of random integers to first generate the sparse matrix the way that the OP posted and then uses a generator expression to test all elements for equality:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))

## matrix generation using generator
data, rows, cols = zip(
    *((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))

##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0)  ## --> True

I think there is no way around the double loop and ideally this would be pushed into numpy, but at least with the generator the loops are somewhat optimised ...

I tried 15000 by 15000 with 10 different values, but of course it depends mostly on the sparsity of the equality matrix, which depends on the distribution of different values. — Radio Controlled, Feb 16 '19 at 10:53
@RadioControlled Sure, but my suggestion is slow because you have to iterate over all i and all j of the 1d arrays. Maybe with the sorting you suggested one could indeed speed up things, but I don't think that there is anything inbuilt... — Thomas Kühn, Feb 16 '19 at 10:55
@RadioControlled I had another idea -- please have a look at the edit in my answer. — Thomas Kühn, Feb 16 '19 at 11:16
That seems to be a good solution! I think the first suggestion can be removed. Thanks! — Radio Controlled, Feb 16 '19 at 11:47

iacob · Answer 2 · 2019-02-16T11:05:55.230

0

You could use numpy.isclose with small tolerance:

np.isclose(a,b)

Or pandas.DataFrame.eq:

a.eq(b)

Note this returns an array of True False.

edited Feb 16 '19 at 11:05

answered Feb 16 '19 at 09:24

iacob

20,084
6
92
119

Seems to me that `np.isclose(a[:,None],b)` returns a dense array, too. Same for pandas, plus more dependencies required... – Radio Controlled Feb 16 '19 at 10:36

Scipy: Sparse indicator matrix from array(s)

2 Answers2