0

Assume a dask dataframe with X partitions. Assume a pandas dataframe with the same X number of rows. Each row of the pandas dataframe contains data relevant for each partition of the dask dataframe.

I would like to assign each pandas df row to a new dask dataframe partition column

import pandas as pd
import dask
imoprt numpy as np

# default dask dataframe with 30 partitions
ddf = dask.datasets.timeseries()

df0 = pd.DataFrame({'A': np.random.randint(0,100, size=30),
                   'B': np.random.randint(0,100, size=30)})

The very inefficient way to do this would be:

df_list = []
for n in range(ddf.npartitions):
    df_list.append(ddf.partitions[n])

for i,df in enumerate(df_list):
    df['A'] = df0['A'].iloc[i]

How can i achieve the same result but remain in dask? Maybe with map_partitions?

If its not possible in dask how can it be more efficient avoiding loops?

Red Sparrow
  • 387
  • 1
  • 5
  • 17
  • An approach to do this (using `map_partitions`) is available in a newer SO answer. It uses `ddf.get_partition(...)` and then appends the extra row with `map_partition`. Check out that answer [here](https://stackoverflow.com/a/65614536/4057186) for details about the implementation. – edesz May 31 '21 at 18:05

1 Answers1

0

Your for loop is only over the number of partitions, which is typically small (less than 10000), so efficiency is unlikely to be a problem here.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • indeed it is not that slow in the end like this. The issue though is that I then end up with a list of dataframes rather than i dask dataframe. Isn't there a way to do this in the first place in dask? – Red Sparrow Oct 21 '19 at 14:08
  • There is no Dask.dataframe operation to do what you want, but after you have your list of small dask dataframes you can then call `dd.concat(df_list, axis=0)` and get one dask dataframe again. – MRocklin Oct 24 '19 at 01:07
  • alright, it seems like there is not better way at least for now. Thanks! – Red Sparrow Oct 25 '19 at 14:22