I have a dataframe that contains a string and 3 other columns that are numpy arrays, and I do the following:
def calculate_hash_chunk(chunk, modalities):
num_modalities = len(modalities)
def _calculate_hash(chunk):
if num_modalities == 1:
return chunk[modalities[0]]
if num_modalities == 2:
hash_0 = chunk[modalities[0]]
hash_1 = chunk[modalities[1]]
return np.bitwise_xor(hash_0, hash_1)
if num_modalities == 3:
hash_0 = chunk[modalities[0]]
hash_1 = chunk[modalities[1]]
hash_2 = chunk[modalities[2]]
return np.bitwise_and(
np.bitwise_xor(hash_0, hash_1),
np.bitwise_xor(hash_0, hash_2),
np.bitwise_xor(hash_1, hash_2))
return chunk.assign(hash_result=chunk.apply(_calculate_hash, axis=1))
def calculate_hash_df(df, modalities, meta):
hash_result = {"hash_result": object}
meta.update(hash_result)
return df.map_partitions(lambda partition: calculate_hash_chunk(partition, modalities), meta=meta)
This performs as expected, however, when I'm trying to use numba njit decorator, the following happens:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type pyobject
During: typing of argument at /tmp/ipykernel_20351/1214465196.py (5)
File "../../../../tmp/ipykernel_20351/1214465196.py", line 5:
<source missing, REPL/exec in use?>
This error may have been caused by the following argument(s):
- argument 0: Cannot determine Numba type of <class 'pandas.core.series.Series'>
That is explaining that numba can not operate with Series. However, is any way to do this so I can increase the performance and have the best of both worlds?
Thanks