5

As a part of data workflow I need to modify values in a subset of dask dataframe columns and pass the results for further computation. In particular, I'm interested in 2 cases: mapping columns and mapping partitions. What is the recommended safe & performant way to act on the data? I'm running it a distributed setup on a cluster with multiple worker processes on each host.

Case1.

I want to run:

res = dataframe.column.map(func, ...)

this returns a data series so I assume that original dataframe is not modified. Is it safe to assign a column back to the dataframe e.g. dataframe['column']=res? Probably not. Should I make a copy with .copy() and then assign result to it like:

dataframe2 = dataframe.copy()
dataframe2['column'] = dataframe.column.map(func, ...)

Any other recommended way to do it?

Case2

I need to map partitions of the dataframe:

df.map_partitions(mapping_func, meta=df)

Inside the mapping_func() I want to modify values in chosen columns, either by using partition[column].map or simply by creating a list comprehension. Again, how do modify the partition safely and return it from the mapping function?

Partition received by mapping function is a Pandas dataframe (copy of original data?) but while modifying data in-place I'm seeing some crashes (no exception/error messages though). Same goes for calling partition.copy(deep=False), it doesn't work. Should partition be deep copied and then modified in-place? Or should I always construct a new dataframe out of new/mapped column data and original/unmodified series/columns?

evilkonrex
  • 255
  • 2
  • 10

1 Answers1

4

You can safely modify a dask.dataframe

Operations like the following are supported and safe

df['col'] = df['col'].map(func)

This modifies the task graph in place but does not modify the data in place (assuming that the function func creates a new series).

You can not safely modify a partition

Your second case when you map_partitions a function that modifies a pandas dataframe in place is not safe. Dask expects to be able to reuse data, call functions twice if necessary, etc.. If you have such a function then you should create a copy of the Pandas dataframe first within that function.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Is this still true? If yes, what is the suggested best practice of doing mutations, which if done using pandas could have been performed in-place, **without** doubling the memory consumption per-partition? – stav Dec 08 '20 at 20:12