I have a huge Dask Dataframe similar to this
|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| AE |....|time|
|-----------------------|
| 2 |val2| FB |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
|n-1|valx| ZK |....|time|
| n |valn| QK |....|time|
and I want to repartition it based on unique values of the C2 column and map a function to each partition.
At first I set C2 as my index:
df = dd.readcsv(...)
df = df.set_index(df.C2)
An now I want to repartition the newly indexed dataframe and map a function to each partition. My current approach looks like this:
unique_c2 = df.index.unique().compute()
df = df.repartition(division=list(unique_c2))
# list(unique_c2) looks like this: ['AE', 'FB', ..., 'ZK', 'QK']
df.map_partitions(lambda x: my_func(x), meta=df)
My desired partitioning should look like this:
|Ind | C1 | C2 |....| Cn |
|------------------------|
| AE |val1| AE |....|time|
|------------------------|
| AE |val2| AE |....|time|
|------------------------|
|....|....| .. |....| ...|
|------------------------|
| AE |valn| AE |....|time|
...
|Ind | C1 | C2 |....| Cn |
|------------------------|
| ZK |val1| ZK |....|time|
|------------------------|
| ZK |val2| ZK |....|time|
|------------------------|
|....|....| .. |....| ...|
|------------------------|
| ZK |valn| ZK |....|time|
|Ind | C1 | C2 |....| Cn |
|------------------------|
| QK |val1| QK |....|time|
|------------------------|
| QK |val2| QK |....|time|
|------------------------|
|....|....| .. |....| ...|
|------------------------|
| QK |valn| QK |....|time|
But the repartition function "merges" my last two indices, so my last partition looks like this:
|Ind | C1 | C2 |....| Cn |
|------------------------|
| ZK |val1| ZK |....|time|
|------------------------|
| ZK |val2| ZK |....|time|
|------------------------|
|....|....| .. |....| ...|
|------------------------|
| QK |valn| QK |....|time|
|------------------------|
|....|....| .. |....| ...|
|------------------------|
| QK |valn| QK |....|time|
Any ideas why this happens or do you have a better solution for my problem? I know that there is a dask.groupby(...).apply(...)
. But my mapping function has side effects and apply(...)
is always executed twice for each dask partition by design.