Assign pandas dataframe row to Dask DataFrame partition

Question

Assume a dask dataframe with X partitions. Assume a pandas dataframe with the same X number of rows. Each row of the pandas dataframe contains data relevant for each partition of the dask dataframe.

I would like to assign each pandas df row to a new dask dataframe partition column

import pandas as pd
import dask
imoprt numpy as np

# default dask dataframe with 30 partitions
ddf = dask.datasets.timeseries()

df0 = pd.DataFrame({'A': np.random.randint(0,100, size=30),
                   'B': np.random.randint(0,100, size=30)})

The very inefficient way to do this would be:

df_list = []
for n in range(ddf.npartitions):
    df_list.append(ddf.partitions[n])

for i,df in enumerate(df_list):
    df['A'] = df0['A'].iloc[i]

How can i achieve the same result but remain in dask? Maybe with map_partitions?

If its not possible in dask how can it be more efficient avoiding loops?

An approach to do this (using `map_partitions`) is available in a newer SO answer. It uses `ddf.get_partition(...)` and then appends the extra row with `map_partition`. Check out that answer [here](https://stackoverflow.com/a/65614536/4057186) for details about the implementation. — edesz, May 31 '21 at 18:05

score 0 · Answer 1 · answered Oct 19 '19 at 13:42

0

Your for loop is only over the number of partitions, which is typically small (less than 10000), so efficiency is unlikely to be a problem here.

answered Oct 19 '19 at 13:42

MRocklin

55,641
23
163
235

indeed it is not that slow in the end like this. The issue though is that I then end up with a list of dataframes rather than i dask dataframe. Isn't there a way to do this in the first place in dask? – Red Sparrow Oct 21 '19 at 14:08
There is no Dask.dataframe operation to do what you want, but after you have your list of small dask dataframes you can then call `dd.concat(df_list, axis=0)` and get one dask dataframe again. – MRocklin Oct 24 '19 at 01:07
alright, it seems like there is not better way at least for now. Thanks! – Red Sparrow Oct 25 '19 at 14:22

Assign pandas dataframe row to Dask DataFrame partition

1 Answers1