2

What is the most efficient way to compute a sparse boolean matrix I from one or two arrays a,b, with I[i,j]==True where a[i]==b[j]? The following is fast but memory-inefficient:

I = a[:,None]==b

The following is slow and still memory-inefficient during creation:

I = csr((a[:,None]==b),shape=(len(a),len(b)))

The following gives at least the rows,cols for better csr_matrix initialization, but it still creates the full dense matrix and is equally slow:

z = np.argwhere((a[:,None]==b))

Any ideas?

hpaulj
  • 221,503
  • 14
  • 230
  • 353
Radio Controlled
  • 825
  • 8
  • 23
  • I guess I could do the whole tour with argsorting a, detect the indeces where sorted array changes, calculate the combinations of each size of partition determined this way, unsort again... But I hope there is a simple numpy or scipy function at hand... – Radio Controlled Feb 16 '19 at 09:06

2 Answers2

2

One way to do it would be to first identify all different elements that a and b have in common using sets. This should work well if there are not very many different possibilities for the values in a and b. One then would only have to loop over the different values (below in variable values) and use np.argwhere to identify the indices in a and b where these values occur. The 2D indices of the sparse matrix can then be constructed using np.repeat and np.tile:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))

##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))

##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []

##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
    x = np.argwhere(a==value).ravel()
    y = np.argwhere(b==value).ravel()    
    rows.append(np.repeat(x, len(x)))
    cols.append(np.tile(y, len(y)))

##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)

##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )

##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)

The syntax for generating the csr matrix is taken from the documentation. The test for sparse matrix equality is taken from this post.

Old Answer:

I don't know about performance, but at least you can avoid constructing the full dense matrix by using a simple generator expression. Here some code that uses two 1d arras of random integers to first generate the sparse matrix the way that the OP posted and then uses a generator expression to test all elements for equality:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))

## matrix generation using generator
data, rows, cols = zip(
    *((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))

##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0)  ## --> True

I think there is no way around the double loop and ideally this would be pushed into numpy, but at least with the generator the loops are somewhat optimised ...

Thomas Kühn
  • 9,412
  • 3
  • 47
  • 63
0

You could use numpy.isclose with small tolerance:

np.isclose(a,b)

Or pandas.DataFrame.eq:

a.eq(b)

Note this returns an array of True False.

iacob
  • 20,084
  • 6
  • 92
  • 119