0

I have two Dask DataFrames with the same number of partitions.

The first one has few columns and few rows for each partition (so Pandas DataFrame), but the number of rows could differs between two partitions (not the columns). The second Dask DataFrame has exactly the same number of partitions but have only 1 value. I'm trying to merge these 2 Dask DataFrames by merging, partition to partition like below.

Dask DataFrame N°1 (toy example):

enter image description here

Dask DataFrame N°2 (with only 1 value per partition and so only 1 column):

enter image description here

And my aim is to obtain a Dask Dataframe like this :

enter image description here

I tried Dask.assign() with Dask.map_partitions() but without success.

Do you have an idea to make this efficently ?

Regards.

rmarion37

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
rmarion37
  • 73
  • 7

2 Answers2

0

If you know for sure that the number of partitions and their order are the same for two dataframes, it's possible to create a custom procedure that processes these partitions in pairs:

from dask import delayed
from dask.dataframe import from_delayed

@delayed
def process_partitions(partition1, partition2):
    ...
    # for example:
    # merged = partition1.merge(partition2)
    # return merged

# assume ddf1, ddf2 have the same number/order of partitions
delayeds = []
for partition1, partition2 in zip(ddf1.partitions, ddf2.partitions):
    delayeds.append(process_partitions(partition1, partition2))

ddf3 = from_delayed(delayeds)
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Thanks, I tried : ``` @dask.delayed def process_partitions(partition1, partition2): partition1.loc[:, 'z'] = partition2 return partition1 delayeds = [] for partition1, partition2 in zip(ddf1.partitions, ddf2.partitions): delayeds.append(process_partitions(partition1, partition2)) ddf3 = from_delayed(delayeds) ddf = ddf3.compute() print(ddf) ``` I have this result : x y z 0 1 1.0 10.0 1 2 2.0 NaN 2 3 3.0 NaN 3 4 4.0 NaN 4 5 5.0 NaN Only the fist line of the first partition is added. Regards. Rémy – rmarion37 Mar 17 '23 at 16:03
  • Hello, Remy, to me it looks like your `process_partitions` function is not defined correctly. Test it with pandas to get the desired behaviour. – SultanOrazbayev Mar 17 '23 at 16:06
  • Yes for sure, it's a problem with `process_partitions` function. I understood that the Dask partitions are individual Pandas DataFrame but even if I use solution OK with Pandas, it doesn't work on Dask Partitions. I tested `insert`, `assign` without success. Very strange .. – rmarion37 Mar 17 '23 at 17:38
0

Thanks for your answer,

To fulfill all the rows for each partition the key point, in the delayed function, is to force the value as a integer (for my example). It's for sure a special tweak from Pandas, not very clear in the docuementation but real !!!

@dask.delayed def process_partitions(partition1, partition2):  
    partition1.loc[:, 'id_cycle'] = int(partition2)
    return partition1

delayeds = []
for partition1, partition2 in zip(ddf1.partitions, ddf2.partitions):
    delayeds.append(process_partitions(partition1, partition2))

ddf3 = from_delayed(delayeds)
ddf = ddf3.compute()
rmarion37
  • 73
  • 7