import numpy
data = numpy.random.randint(0, 10, (6,8))
test = set(numpy.random.randint(0, 10, 5))
I want an expression whose value is a Boolean array, with the same shape of data
(or, at least, can be reshaped to the same shape), that tells me if the corresponding term in data
is in set
.
E.g., if I want to know which elements of data
are strictly less than 6
, I can use a single vectorized expression,
a = data < 6
that computes a 6x8
boolean ndarray. On the contrary, when I try an apparently equivalent boolean expression
b = data in test
what I get is an exception:
TypeError: unhashable type: 'numpy.ndarray'
Addendum — benmarching different solutions
Edit: the possibility #4 below gives wrong results, thanks to hpaulj and Divakar for getting me on the right track.
Here I compare four different possibilities,
- What was proposed by Divakar,
np.in1d(data, np.hstack(test))
. - One proposal by hpaulj,
np.in1d(data, np.array(list(test)))
. - Another proposal by hpaulj, `np.in1d(data, np.fromiter(test, int)).
What was proposed in an answer removed by its author, whose name I dont remember,np.in1d(data, test)
.
Here it is the Ipython session, slightly edited to avoid blank lines
In [1]: import numpy as np
In [2]: nr, nc = 100, 100
In [3]: top = 3000
In [4]: data = np.random.randint(0, top, (nr, nc))
In [5]: test = set(np.random.randint(0, top, top//3))
In [6]: %timeit np.in1d(data, np.hstack(test))
100 loops, best of 3: 5.65 ms per loop
In [7]: %timeit np.in1d(data, np.array(list(test)))
1000 loops, best of 3: 1.4 ms per loop
In [8]: %timeit np.in1d(data, np.fromiter(test, int))
1000 loops, best of 3: 1.33 ms per loop
In [9]: %timeit np.in1d(data, test)
1000 loops, best of 3: 687 µs per loop
In [10]: nr, nc = 1000, 1000
In [11]: top = 300000
In [12]: data = np.random.randint(0, top, (nr, nc))
In [13]: test = set(np.random.randint(0, top, top//3))
In [14]: %timeit np.in1d(data, np.hstack(test))
1 loop, best of 3: 706 ms per loop
In [15]: %timeit np.in1d(data, np.array(list(test)))
1 loop, best of 3: 269 ms per loop
In [16]: %timeit np.in1d(data, np.fromiter(test, int))
1 loop, best of 3: 274 ms per loop
In [17]: %timeit np.in1d(data, test)
10 loops, best of 3: 67.9 ms per loop
In [18]:
The better times are given by the (now) anonymous poster's answer.
It turns out that the anonymous poster had a good reason to remove their answer, the results being wrong!
As commented by hpaulj, in the documentation of in1d
there is a warning against the use of a set
as the second argument, but I'd like better an explicit failure if the computed results could be wrong.
That said, the solution using numpy.fromiter()
has the best numbers...