Safe & performant way to modify dask dataframe

Question

As a part of data workflow I need to modify values in a subset of dask dataframe columns and pass the results for further computation. In particular, I'm interested in 2 cases: mapping columns and mapping partitions. What is the recommended safe & performant way to act on the data? I'm running it a distributed setup on a cluster with multiple worker processes on each host.

Case1.

I want to run:

res = dataframe.column.map(func, ...)

this returns a data series so I assume that original dataframe is not modified. Is it safe to assign a column back to the dataframe e.g. dataframe['column']=res? Probably not. Should I make a copy with .copy() and then assign result to it like:

dataframe2 = dataframe.copy()
dataframe2['column'] = dataframe.column.map(func, ...)

Any other recommended way to do it?

Case2

I need to map partitions of the dataframe:

df.map_partitions(mapping_func, meta=df)

Inside the mapping_func() I want to modify values in chosen columns, either by using partition[column].map or simply by creating a list comprehension. Again, how do modify the partition safely and return it from the mapping function?

Partition received by mapping function is a Pandas dataframe (copy of original data?) but while modifying data in-place I'm seeing some crashes (no exception/error messages though). Same goes for calling partition.copy(deep=False), it doesn't work. Should partition be deep copied and then modified in-place? Or should I always construct a new dataframe out of new/mapped column data and original/unmodified series/columns?

score 4 · Accepted Answer · answered Sep 05 '17 at 12:03

You can safely modify a dask.dataframe

Operations like the following are supported and safe

df['col'] = df['col'].map(func)

This modifies the task graph in place but does not modify the data in place (assuming that the function func creates a new series).

You can not safely modify a partition

Your second case when you map_partitions a function that modifies a pandas dataframe in place is not safe. Dask expects to be able to reuse data, call functions twice if necessary, etc.. If you have such a function then you should create a copy of the Pandas dataframe first within that function.

Is this still true? If yes, what is the suggested best practice of doing mutations, which if done using pandas could have been performed in-place, **without** doubling the memory consumption per-partition? — stav, Dec 08 '20 at 20:12

Safe & performant way to modify dask dataframe

1 Answers1

You can safely modify a dask.dataframe

You can not safely modify a partition