I'm implementing some search algorithm using numpy
where one step is to check weather a vector is in a matrix (as row). I used to use np.isin
before, but I suddenly become curious will the python keyword in
work. I therefore tested it and find it do works.
Since I didn't find any python interface for in
(like __add__
for +
or __abs__
for abs
), I believe in
is hard-wired in python by using standard iterator logic, therefore it should be slower compared to the numpy
-provided np.isin
. But after I did some testing, unbelievably:
>>> a = np.int8(1)
>>> A = np.zeros(2**24, 'b')
>>> %timeit a in A
>>> %timeit np.isin(a, A)
21.7 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
310 ms ± 20.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
which sais np.isin
is 10+ times slower than python in
for small data type. I also did a test for big data type
>>> a = np.ones(1, 'V256')
>>> A = np.zeros(2**22, 'V256')
>>> %timeit a in A
>>> %timeit np.isin(a, A)
129 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.5 s ± 184 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
which sais np.isin
is ~100 times slower.
I'm wondering what could be the reason for this. Note since a=1
while A=[0,0,...]
, the match will have to be done on the whole array. There's no such thing as "early exit" on python side.
EDIT Oh actually there is python interface for in
called __contains__
. But still why would np.isin
be way slower than np.ndarray.__contains__
?