0

I have a dataframe that contains a string and 3 other columns that are numpy arrays, and I do the following:

def calculate_hash_chunk(chunk, modalities):
    num_modalities = len(modalities)
    
    def _calculate_hash(chunk):
        if num_modalities == 1:
            return chunk[modalities[0]]
        if num_modalities == 2:
            hash_0 = chunk[modalities[0]]
            hash_1 = chunk[modalities[1]]
            return np.bitwise_xor(hash_0, hash_1)
        if num_modalities == 3:
            hash_0 = chunk[modalities[0]]
            hash_1 = chunk[modalities[1]]
            hash_2 = chunk[modalities[2]]
            return np.bitwise_and(
                        np.bitwise_xor(hash_0, hash_1),
                        np.bitwise_xor(hash_0, hash_2),
                        np.bitwise_xor(hash_1, hash_2))
    return chunk.assign(hash_result=chunk.apply(_calculate_hash, axis=1))


def calculate_hash_df(df, modalities, meta):
    hash_result = {"hash_result": object}
    meta.update(hash_result)
    return df.map_partitions(lambda partition: calculate_hash_chunk(partition, modalities), meta=meta)

This performs as expected, however, when I'm trying to use numba njit decorator, the following happens:

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type pyobject
During: typing of argument at /tmp/ipykernel_20351/1214465196.py (5)

File "../../../../tmp/ipykernel_20351/1214465196.py", line 5:
<source missing, REPL/exec in use?> 

This error may have been caused by the following argument(s):
- argument 0: Cannot determine Numba type of <class 'pandas.core.series.Series'>

That is explaining that numba can not operate with Series. However, is any way to do this so I can increase the performance and have the best of both worlds?

Thanks

Norhther
  • 545
  • 3
  • 15
  • 35
  • Sounds like your `df` gets partitioned into a Series object, which gets passed to the Numba function. But Numba won't work with such high level objects. You should probably decorate a lower level function, like the one passed to `apply()`. See the Pandas documentation for some examples on how to use Numba: https://pandas.pydata.org/docs/user_guide/enhancingperf.html#numba-jit-compilation – Rutger Kassies Apr 26 '23 at 13:49
  • To be sure: it's the `_calculate_hash` that you want to decorate with JIT? Does it work with pandas (not dask) on part of the data? Have you tried defining the function standalone rather than dynamic? – mdurant Apr 26 '23 at 15:16

1 Answers1

-2

numba only operates on numpy arrays. If a pandas.Series is backed by numpy (as opposed to pyArrow), you can do this:

def external(s: pandas.Series):
    # If internal reduces to scalar
    return internal(s.values)
    # If internal is a 1:1 map
    return pandas.Series(internal(s.values), index=s.index, name=s.name)

@njit
def internal(a: numpy.ndarray) -> numpy.ndarray:
    ...
crusaderky
  • 2,552
  • 3
  • 20
  • 28