5

I'm trying to implement a faster version of the np.isin in numba, this is what I have so far:

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def isin(a, b):
    out=np.empty(a.shape[0], dtype=nb.boolean)
    b = set(b)
    for i in nb.prange(a.shape[0]):
        if a[i] in b:
            out[i]=True
        else:
            out[i]=False
    return out

For numbers it works, as seen in this example:

a = np.array([1,2,3,4])
b = np.array([2,4])

isin(a,b)
>>> array([False,  True, False,  True])

And it's faster than np.isin:

a = np.random.rand(20000)
b = np.random.rand(5000)

%time isin(a,b)
CPU times: user 3.96 ms, sys: 0 ns, total: 3.96 ms
Wall time: 1.05 ms

%time np.isin(a,b)
CPU times: user 11 ms, sys: 0 ns, total: 11 ms
Wall time: 8.48 ms

However, I would like to use this function with arrays containing strings. The problem is that whenever I try to pass an array of strings, numba complains that it cannot interpret the set() operation with these data.

a = np.array(['A','B','C','D'])
b = np.array(['B','D'])

isin(a,b)
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<class 'set'>) found for signature:
 
 >>> set(array([unichr x 1], 1d, C))
 
There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload of function 'set': File: numba/core/typing/setdecl.py: Line 20.
    With argument(s): '(array([unichr x 1], 1d, C))':
   No match.

During: resolving callee type: Function(<class 'set'>)
During: typing of call at /tmp/ipykernel_20582/4221597969.py (7)


File "../../../../tmp/ipykernel_20582/4221597969.py", line 7:
<source missing, REPL/exec in use?>

Is there a way, like specifying a signature, that will allow me to use this directly on arrays of strings?

I know I could assign a numerical value to each string, but for large arrays I think this will take a while and will make the whole thing slower than just using np.isin.

Any ideas?

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
Pau
  • 536
  • 5
  • 16
  • Are you sure this is faster than the numpy version? – Z Li Jan 26 '22 at 16:05
  • It is indeed, I added an example just to show the difference in time. – Pau Jan 26 '22 at 16:25
  • weird on my platform the numpy version is faster – Z Li Jan 26 '22 at 16:27
  • The first run of a numba function is always a bit slower, since it has to compile the code. Then it becomes much faster. This could be it. – Pau Jan 26 '22 at 16:30
  • yes I am aware of that. I test it after the first compile. To solve your issue with str I would probably do set operation outside function when you pass in the arguments. – Z Li Jan 26 '22 at 16:32
  • Thanks for the suggestion. How can I input a set of strings into a numba function then? I tried it but I get similar signature errors – Pau Jan 26 '22 at 16:34
  • 1
    probably convert it to a list `list(set(b))`, not sure about the performance though. You could check the numpy implementation, which is quite interesting. https://github.com/numpy/numpy/blob/4adc87dff15a247e417d50f10cc4def8e1c17a03/numpy/lib/arraysetops.py#L520 – Z Li Jan 26 '22 at 16:35
  • 1
    If *b* is not a set `a[i] in b` becomes linear. And will throw an error anyway. – Michael Szczesny Jan 26 '22 at 16:39
  • It's possible, even likely, that `numba` has not extended its `set` to work with strings. Strings, especially the default unicode, are more complicated than simple numbers, and may still depend heavily on Python's own object handling. – hpaulj Jan 26 '22 at 17:02
  • This might be worth raising on the [numba github page](https://github.com/numba/numba/issues) - users have reported similar issues with unicode strings in sets in the past – tgpz Jan 27 '22 at 00:06
  • 1
    The Numpy code is indeed interesting! I always though that Numpy used a hash table internally (at least when the number of unique keys is small since it would be inefficient otherwise). It surprisingly turns out not to be the case! This is certainly a missed optimization (I guess this can be at least done for simple types). – Jérôme Richard Jan 27 '22 at 03:13
  • @Pau: Did you ever get this to work? – Tarun Jul 20 '22 at 17:16
  • @Tarun yes kinda! Since there is no current support for strings, I transformed my string vector to a numeric one (basically their unique identifying index) to make `numba` work. Once my process has run I trasnform them back to strings outside of numba. Hope this is helpful! – Pau Jul 20 '22 at 17:27
  • 1
    @Pau: Thanks for the update. That approach makes sense. Also, this might be of interest for comparative benchmarking: https://stackoverflow.com/questions/53423924/why-is-python-in-much-faster-than-np-isin – Tarun Jul 21 '22 at 00:27
  • 1
    @Pau Unrelated to your question, but I just tested and you can use `out[i] = a[i] in b` without any performance hit. – Guimoute Oct 26 '22 at 16:36

1 Answers1

2

Strings are barely supported by Numba (like bytes although the support is slightly better). Set and dictionary are supported with some strict restriction and are quite experimental/new. Sets of strings are not supported yet regarding the documentation:

Sets must be strictly homogeneous: Numba will reject any set containing objects of different types, even if the types are compatible (for example, {1, 2.5} is rejected as it contains a int and a float). The use of reference counted types, e.g. strings, in sets is unsupported.

You can try to cheat using a binary search. Unfortunately, np.searchsorted is not yet implemented for string-typed Numpy array (although np.unique is). I think you can implement a binary search yourself but this is cumbersome to do and in the end. I am not sure this will be faster in the end, but I think it should because of the O(Ns Na log Nb)) running time complexity (with Ns the average size of string length of b unique items, Na the number of items in a and Nb the number of unique items in b). Indeed, the running time complexity of np.isin is O(Ns (Na+Nb) log (Na+Nb)) if arrays are about similar sizes and O(Ns Na Nb) if Nb is much smaller than Na. Note that the best theoretical running time complexity is AFAIK O(Ns (Na + Nb)) thanks to a hash table with a good hash function (Tries can also achieve this but they should be practically slower unless the hash function is not great).

Note that typed dictionaries support statically declared string but not dynamic ones (and this is an experimental feature for static ones).

Another cheat (that should work) is to store string hashes as the key of a typed dictionary and associate each hash to an array of indices referencing the string locations in b having the hash of the associated key. The parallel loop needs to hash a items and fetch the location of strings item in b having this hash so you can then compare the strings. A faster implementation would be to assume that the hash function for b strings is perfect and that there is no collision so you can directly use a TypedDict[np.int64, np.int64] hash table. You can test this hypothesis at runtime when you build b. Writing such a code is a bit tedious. Note that this implementation may not be faster than Numpy in the end in sequential because Numba TypedDicts are currently pretty slow... However, this should be faster in parallel on processors with enough cores.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59