I'm trying to implement a faster version of the np.isin
in numba
, this is what I have so far:
import numpy as np
import numba as nb
@nb.njit(parallel=True)
def isin(a, b):
out=np.empty(a.shape[0], dtype=nb.boolean)
b = set(b)
for i in nb.prange(a.shape[0]):
if a[i] in b:
out[i]=True
else:
out[i]=False
return out
For numbers it works, as seen in this example:
a = np.array([1,2,3,4])
b = np.array([2,4])
isin(a,b)
>>> array([False, True, False, True])
And it's faster than np.isin
:
a = np.random.rand(20000)
b = np.random.rand(5000)
%time isin(a,b)
CPU times: user 3.96 ms, sys: 0 ns, total: 3.96 ms
Wall time: 1.05 ms
%time np.isin(a,b)
CPU times: user 11 ms, sys: 0 ns, total: 11 ms
Wall time: 8.48 ms
However, I would like to use this function with arrays containing strings. The problem is that whenever I try to pass an array of strings, numba
complains that it cannot interpret the set()
operation with these data.
a = np.array(['A','B','C','D'])
b = np.array(['B','D'])
isin(a,b)
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<class 'set'>) found for signature:
>>> set(array([unichr x 1], 1d, C))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload of function 'set': File: numba/core/typing/setdecl.py: Line 20.
With argument(s): '(array([unichr x 1], 1d, C))':
No match.
During: resolving callee type: Function(<class 'set'>)
During: typing of call at /tmp/ipykernel_20582/4221597969.py (7)
File "../../../../tmp/ipykernel_20582/4221597969.py", line 7:
<source missing, REPL/exec in use?>
Is there a way, like specifying a signature, that will allow me to use this directly on arrays of strings?
I know I could assign a numerical value to each string, but for large arrays I think this will take a while and will make the whole thing slower than just using np.isin
.
Any ideas?