6
import numpy
data = numpy.random.randint(0, 10, (6,8))
test = set(numpy.random.randint(0, 10, 5))

I want an expression whose value is a Boolean array, with the same shape of data (or, at least, can be reshaped to the same shape), that tells me if the corresponding term in data is in set.

E.g., if I want to know which elements of data are strictly less than 6, I can use a single vectorized expression,

a = data < 6

that computes a 6x8 boolean ndarray. On the contrary, when I try an apparently equivalent boolean expression

b = data in test

what I get is an exception:

TypeError: unhashable type: 'numpy.ndarray'

Addendum — benmarching different solutions

Edit: the possibility #4 below gives wrong results, thanks to hpaulj and Divakar for getting me on the right track.

Here I compare four different possibilities,

  1. What was proposed by Divakar, np.in1d(data, np.hstack(test)).
  2. One proposal by hpaulj, np.in1d(data, np.array(list(test))).
  3. Another proposal by hpaulj, `np.in1d(data, np.fromiter(test, int)).
  4. What was proposed in an answer removed by its author, whose name I dont remember, np.in1d(data, test).

Here it is the Ipython session, slightly edited to avoid blank lines

In [1]: import numpy as np
In [2]: nr, nc = 100, 100
In [3]: top = 3000
In [4]: data = np.random.randint(0, top, (nr, nc))
In [5]: test = set(np.random.randint(0, top, top//3))
In [6]: %timeit np.in1d(data, np.hstack(test))
100 loops, best of 3: 5.65 ms per loop
In [7]: %timeit np.in1d(data, np.array(list(test)))
1000 loops, best of 3: 1.4 ms per loop
In [8]: %timeit np.in1d(data, np.fromiter(test, int))
1000 loops, best of 3: 1.33 ms per loop

In [9]: %timeit np.in1d(data, test)
1000 loops, best of 3: 687 µs per loop

In [10]: nr, nc = 1000, 1000
In [11]: top = 300000
In [12]: data = np.random.randint(0, top, (nr, nc))
In [13]: test = set(np.random.randint(0, top, top//3))
In [14]: %timeit np.in1d(data, np.hstack(test))
1 loop, best of 3: 706 ms per loop
In [15]: %timeit np.in1d(data, np.array(list(test)))
1 loop, best of 3: 269 ms per loop
In [16]: %timeit np.in1d(data, np.fromiter(test, int))
1 loop, best of 3: 274 ms per loop

In [17]: %timeit np.in1d(data, test)
10 loops, best of 3: 67.9 ms per loop

In [18]: 

The better times are given by the (now) anonymous poster's answer.

It turns out that the anonymous poster had a good reason to remove their answer, the results being wrong!

As commented by hpaulj, in the documentation of in1d there is a warning against the use of a set as the second argument, but I'd like better an explicit failure if the computed results could be wrong.

That said, the solution using numpy.fromiter() has the best numbers...

gboffi
  • 22,939
  • 8
  • 54
  • 85
  • 1
    The output would be of the same shape as `data` or as `set`? – Divakar Jun 12 '16 at 13:53
  • What would the expected output be? `data < 6` produces a new array for example. – Martijn Pieters Jun 12 '16 at 13:56
  • @Divakar Not necessarily as long as the size is the same and I can use `numpy.reshape` – gboffi Jun 12 '16 at 13:56
  • 1
    @gboffi: that's not all that clear. `data in test` would produce either `True` or `False` since `set` objects are not `numpy` objects. – Martijn Pieters Jun 12 '16 at 13:58
  • @MartijnPieters Tx for the edit, I put `data < 6` exactly as an example of the type of expression and the type of output that I want, and the output would be, for the particular problem, a `6x8` boolean ndarray. – gboffi Jun 12 '16 at 13:59
  • 1
    @MartijnPieters `data in test` produces a `TypeError` --- I'm going to edit my question to address Divakar's comments as well yours. – gboffi Jun 12 '16 at 14:02
  • @HåkenLid IThank you for your edit, but I feel I have it reverted, because I knew already why `data in set` doesn;t work before asking my question, that imho has nothing to do with the title you've put on it. – gboffi Jun 12 '16 at 21:19
  • No problem. You don't need my permission to change the title back, just click edit. If possible, it's preferable that the title is in the form of a question, though. – Håken Lid Jun 12 '16 at 21:24
  • `np.in1d(data, test)` is fastest because it doesn't check against all the members of the set, and thus gives a completely different result than the others. – Håken Lid Jun 13 '16 at 09:18
  • Since you haven't included my suggestion in your benchmarks, I've added a benchmark to my answer. – Håken Lid Jun 13 '16 at 09:20

2 Answers2

6

I am assuming you are looking to find a boolean array to detect the presence of the set elements in data array. To do so, you can extract the elements from set with np.hstack and then use np.in1d to detect presence of any element from set at each position in data, giving us a boolean array of the same size as data. Since, np.in1d flattens the input before processing, so as a final step, we need to reshape the output from np.in1d back to its original 2D shape. Thus, the final implementation would be -

np.in1d(data,np.hstack(test)).reshape(data.shape)

Sample run -

In [125]: data
Out[125]: 
array([[7, 0, 1, 8, 9, 5, 9, 1],
       [9, 7, 1, 4, 4, 2, 4, 4],
       [0, 4, 9, 6, 6, 3, 5, 9],
       [2, 2, 7, 7, 6, 7, 7, 2],
       [3, 4, 8, 4, 2, 1, 9, 8],
       [9, 0, 8, 1, 6, 1, 3, 5]])

In [126]: test
Out[126]: {3, 4, 6, 7, 9}

In [127]: np.in1d(data,np.hstack(test)).reshape(data.shape)
Out[127]: 
array([[ True, False, False, False,  True, False,  True, False],
       [ True,  True, False,  True,  True, False,  True,  True],
       [False,  True,  True,  True,  True,  True, False,  True],
       [False, False,  True,  True,  True,  True,  True, False],
       [ True,  True, False,  True, False, False,  True, False],
       [ True, False, False, False,  True, False,  True, False]], dtype=bool)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • You have _perfectly_ understood what I was asking, thank you. I didn't know of `in1d`: `numpy`'s namespace is a real treasure chest... For the moment, your answer is correct and there is a little extra gem hidden in `np.hstack(test)`, I'd likely approve it as soon as possible, for the while it's a +1 – gboffi Jun 12 '16 at 14:18
  • @gboffi Bit of guess work, but glad to finally get it right! :) Yeah I just discovered while answering this, that one can convert set to NumPy array with `np.hstack` (at least for such a set). I don't deal with sets a lot, but seems like useful stuff that one. – Divakar Jun 12 '16 at 14:20
  • 1
    `hstack` is using `[np.atleast_1d(i) for i in test]` convert the set to a list of arrays. `np.array(list(test))` or `np.fromiter(test,int)` also work. – hpaulj Jun 12 '16 at 17:29
  • @hpaulj In effects it is possible to call directly `np.in1d(data, test)`, as was shown in a good answer that its poster has removed. In this case (using directly a `set`, with no conversion) `numpy` has to do something behind the scenes... I'll try to benchmark and show the results. – gboffi Jun 12 '16 at 21:59
  • `in1d` warns against using a `set` directly. In your time test cases, check whether the results are the same. I get all `False` in the direct `test` case. – hpaulj Jun 12 '16 at 22:19
  • Study the `np.in1d` code. Run its code step by step on a small test case. It will be instructive. It uses a rather cleaver idea, involving concatenation and sorting. – hpaulj Jun 12 '16 at 22:45
  • 1
    @gboffi `np.in1d` expects NumPy arrays as inputs. `test` is a set, so we have to convert that to a regular NumPy array. `np.in1d(data, test)` won't give you the correct results. If you manually inspect the results for a `6x8` array, you would see. Or use `np.allclose()` between outputs for that same verification. – Divakar Jun 13 '16 at 04:06
4

The expression a = data < 6 returns a new array because < is a value comparison operator.

Arithmetic, matrix multiplication, and comparison operations

Arithmetic and comparison operations on ndarrays are defined as element-wise operations, and generally yield ndarray objects as results.

Each of the arithmetic operations (+, -, *, /, //, %, divmod(), ** or pow(), <<, >>, &, ^, |, ~) and the comparisons (==, <, >, <=, >=, !=) is equivalent to the corresponding universal function (or ufunc for short) in Numpy.

Note that the in operator is not in this list. Probably because it works in the opposite direction to most operators.

While a + b is the same as a.__add__(b), a in b works right to left b.__contains__(a). In this case python tries to call set.__contains__(), which will only accept hashable/immutable types. Arrays are mutable, so they can't be a member of a set.

A solution to this is to use numpy.vectorize instead of in directly, and call any python function on each element in the array.

It's a kind of map() for numpy arrays.

numpy.vectorize

Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns a numpy array as output. The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.

>>> import numpy
>>> data = numpy.random.randint(0, 10, (3, 3))
>>> test = set(numpy.random.randint(0, 10, 5))
>>> numpy.vectorize(test.__contains__)(data)

array([[False, False,  True],
       [ True,  True, False],
       [ True, False,  True]], dtype=bool)

Benchmarks

This approach is fast when n is large, since set.__contains__() is a constant time operation. ("large" means thattop > 13000 or so)

>>> import numpy as np
>>> nr, nc = 100, 100
>>> top = 300000
>>> data = np.random.randint(0, top, (nr, nc))
>>> test = set(np.random.randint(0, top, top//3))
>>> %timeit -n10 np.in1d(data, list(test)).reshape(data.shape)
10 loops, best of 3: 26.2 ms per loop

>>> %timeit -n10 np.in1d(data, np.hstack(test)).reshape(data.shape)
10 loops, best of 3: 374 ms per loop

>>> %timeit -n10 np.vectorize(test.__contains__)(data)
10 loops, best of 3: 3.16 ms per loop

However, when n is small, the other solutions are significantly faster.

Håken Lid
  • 22,318
  • 9
  • 52
  • 67
  • is it possible to vectorize (i.e. speed up) the 2D DFT operation expressed in the link https://stackoverflow.com/questions/70768384/right-method-for-finding-2-d-spatial-spectrum-from-cross-spectral-densities? – pluto Feb 07 '22 at 11:42