2

I have a huge Dask Dataframe similar to this

|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| AE |....|time|
|-----------------------|
| 2 |val2| FB |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
|n-1|valx| ZK |....|time|
| n |valn| QK |....|time|

and I want to repartition it based on unique values of the C2 column and map a function to each partition.

At first I set C2 as my index:

df = dd.readcsv(...)

df = df.set_index(df.C2)

An now I want to repartition the newly indexed dataframe and map a function to each partition. My current approach looks like this:

unique_c2 = df.index.unique().compute()

df = df.repartition(division=list(unique_c2))

# list(unique_c2) looks like this: ['AE', 'FB', ..., 'ZK', 'QK']

df.map_partitions(lambda x: my_func(x), meta=df)

My desired partitioning should look like this:

|Ind | C1 | C2 |....| Cn |
|------------------------|
| AE |val1| AE |....|time|
|------------------------|
| AE |val2| AE |....|time|
|------------------------|
|....|....| .. |....| ...|
|------------------------|
| AE |valn| AE |....|time|

...

|Ind | C1 | C2 |....| Cn |
|------------------------|
| ZK |val1| ZK |....|time|
|------------------------|
| ZK |val2| ZK |....|time|
|------------------------|
|....|....| .. |....| ...|
|------------------------|
| ZK |valn| ZK |....|time|

|Ind | C1 | C2 |....| Cn |
|------------------------|
| QK |val1| QK |....|time|
|------------------------|
| QK |val2| QK |....|time|
|------------------------|
|....|....| .. |....| ...|
|------------------------|
| QK |valn| QK |....|time|

But the repartition function "merges" my last two indices, so my last partition looks like this:

|Ind | C1 | C2 |....| Cn |
|------------------------|
| ZK |val1| ZK |....|time|
|------------------------|
| ZK |val2| ZK |....|time|
|------------------------|
|....|....| .. |....| ...|
|------------------------|
| QK |valn| QK |....|time|
|------------------------|
|....|....| .. |....| ...|
|------------------------|
| QK |valn| QK |....|time|

Any ideas why this happens or do you have a better solution for my problem? I know that there is a dask.groupby(...).apply(...). But my mapping function has side effects and apply(...) is always executed twice for each dask partition by design.

pichlbaer
  • 923
  • 1
  • 10
  • 18

1 Answers1

3

The number of divisions is always (number of partitions + 1), due to the way they are designed. From the docs:

Divisions includes the minimum value of every partition’s index and the maximum value of the last partition’s index.

Because you have set divisions=list(unique_c2), you will only have the same number of divisions as the number of unique c2 values, by which you want to partition. So the number of partitions will be one fewer than you desire.

You can fix this by changing the code to:

    unique_c2_list = list(df.index.unique().compute())

    df = df.repartition(divisions=sorted(unique_c2_list + [unique_c2_list[-1]]))

This will add the value of the last unique c2 value to the end of the list of divisions. For the final division, the min and max value of c2 will be identical, so this will produce the desired number of partitions and prevent the last two from being merged.

elukem
  • 1,068
  • 10
  • 11