2

I'm implementing some search algorithm using numpy where one step is to check weather a vector is in a matrix (as row). I used to use np.isin before, but I suddenly become curious will the python keyword in work. I therefore tested it and find it do works.

Since I didn't find any python interface for in (like __add__ for + or __abs__ for abs), I believe in is hard-wired in python by using standard iterator logic, therefore it should be slower compared to the numpy-provided np.isin. But after I did some testing, unbelievably:

>>> a = np.int8(1)
>>> A = np.zeros(2**24, 'b')
>>> %timeit a in A
>>> %timeit np.isin(a, A)
21.7 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
310 ms ± 20.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

which sais np.isin is 10+ times slower than python in for small data type. I also did a test for big data type

>>> a = np.ones(1, 'V256')
>>> A = np.zeros(2**22, 'V256')
>>> %timeit a in A
>>> %timeit np.isin(a, A)
129 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.5 s ± 184 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

which sais np.isin is ~100 times slower.

I'm wondering what could be the reason for this. Note since a=1 while A=[0,0,...], the match will have to be done on the whole array. There's no such thing as "early exit" on python side.

EDIT Oh actually there is python interface for in called __contains__. But still why would np.isin be way slower than np.ndarray.__contains__?

ZisIsNotZis
  • 1,570
  • 1
  • 13
  • 30
  • "I therefore tested it and find it do works." - [nope](https://stackoverflow.com/questions/18320624/how-does-contains-work-for-ndarrays). – user2357112 Nov 22 '18 at 04:38
  • "Since I didn't find any python interface for `in`" - [you were probably looking in the wrong places](https://docs.python.org/3/reference/datamodel.html#object.__contains__). – user2357112 Nov 22 '18 at 04:39
  • @user2357112 I know it only works on scalar. You can make it work by viewing the rows as a scalar. Therefore it becomes identical to testing `in` on `VsomeNumber` data type – ZisIsNotZis Nov 22 '18 at 04:41
  • `isin` is Python code which you can read. Or read the whole `arraysetops.py`. I think this code is written more for convenience than for performance. Arrays aren't optimal for search tasks. – hpaulj Nov 22 '18 at 06:49

2 Answers2

6

numpy.ndarray.__contains__ is basically just (elem == arr).any() (even when that doesn't make sense). You can take a look at the source, which is very short and simple for a NumPy C routine.

numpy.isin broadcasts over its left operand, and it's optimized for efficiency in the broadcasting case. For a small left operand, it will use an approach based on sorting, which is overkill for a scalar. It currently has no fast path for the left operand being a scalar, or for the left hand being an array small enough that sorting is more expensive than a naive approach.

user2357112
  • 260,549
  • 28
  • 431
  • 505
-2

My answer is not as asked. May be give you some idea. Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.

Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.

user10468005
  • 157
  • 3
  • 11
  • "high performance programming" is a concept I have never heard before. – b-fg Nov 22 '18 at 05:18
  • The highers performance is machine code or assembly language (different representations of the same thing.) You're not limited by any constraints other than the harware capabilities of the device. (IOW, there's no "make toast" instruction.) After that, which language has the best performance is like asking which hand tool has the best performance. It's difficult to cut wood with a hammer or drive a nail with a saw. Some languages have better performance doing some things, some have better performance doing other things. – user10468005 Nov 22 '18 at 06:14