0

I want to append a row to a particular partition in dask dataframes. I have tried out many methods but none of them are possible. Can anyone help me on this. Thanks in advance

I tried -

first_partition = df.partitions[0]
new_dd = first_partiton.append(row)
df.partitions[0] = new_dd

This doesn't work

I even tried to use map_partitions(), but even this function doesn't really help to get the metadata of the partition to modify a particular partition.

Is it possible to save the dataframe as parquet and modify just a particular parquet file and save it back? - I tried this, even this seems to not work.

Srimanth
  • 13
  • 3

1 Answers1

0

Using map_partitions you can modify that particular partition.

Then create a new frame by replacing the modified partition in the dataframe by switching to delayed objects, replacing the delayed object into the list, and then switching back to dask dataframe.


def append_row_dict(df, row_dict):
    small_df = pd.DataFrame(row_dict)
    return df.append(small_df)
    
p_df = pd.DataFrame({'a':np.arange(0,10)})

dask_df = dd.from_pandas(p_df,npartitions=4)
part_to_change = 1

new_partion = dask_df.get_partition(part_to_change).map_partitions(append_row_dict,{'a':[-1]})
list_of_delayed = dask_df.to_delayed()

## we only have 1 delayed object for 1 partition
assert new_partion.npartitions==1
list_of_delayed[part_to_change]=new_partion.to_delayed()[0]

new_dask_df = dd.from_delayed(list_of_delayed, meta=dask_df._meta)
new_dask_df.get_partition(part_to_change).compute()
    a
3   3
4   4
5   5
0   -1
Vibhu Jawa
  • 88
  • 9