I have a Pandas DataFrame/Polars dataframe / Pyarrow table with a string key column. You can assume the strings are random. I want to partition that dataframe into N smaller dataframes based on this key column.
With an integer column, I can just use df1 = df[df.key % N == 1]
, df2 = df[df.key % N == 2]
etc.
My best guess at how you are going to do that with a string column is apply a hash function (e.g. summing the ascii values of the string) to convert it to an integer column, then use the modulus.
Please let me know what's the most efficient way this can be done in either Pandas, Polars or Pyarrow, ideally with pure columnar operations within the API. Doing a df.apply is likely too slow for my use case.