7

Given a pandas df one can copy it before doing anything via:

df.copy()

How can I do this with a dask dataframe object?

cel
  • 30,017
  • 18
  • 97
  • 117
Michael
  • 347
  • 5
  • 13

4 Answers4

9

Mutation on dask.dataframe objects is rare, so this is rarely necessary.

That being said, you can safely just copy the object

from copy import copy
df2 = copy(df)

No dask.dataframe operation mutates any of the fields of the dataframe, so this is sufficient.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
1

Dask creates internal pipelines of lazy computations. Every version of your dataframe is another layer of computations which are not computed until later.

You can branch from these computations by either copying it like @MRocklin suggests, then you're working on a brand new stack of computations, or you can continue on the same stack by doing:

df = df[df.columns]
André C. Andersen
  • 8,955
  • 3
  • 53
  • 79
1

It is possible you want to have two versions of your data, one after a mutation. There is a copy method on dask dataframes you can use; it likely does the same as python's copy.copy, but if feels safer (to me) to use the library maintainer's version.

import dask.dataframe as dd
ddf = dd.from_pandas(pd.DataFrame({'z': [1, 2]}), npartitions=1)
ddf2 = ddf.copy()
ddf2['z'] -= 10

print(ddf.compute())
print()
print(ddf2.compute())
   z
0  1
1  2

   z
0 -9
1 -8
HoosierDaddy
  • 720
  • 6
  • 19
-3

Write it to a file and read again:

import os
import dask.dataframe as dd

df = <Initial Dask Dataframe to be copied>
file = 'sample.csv'
df.to_csv(file)
df2 = df.read_csv(file)
os.remove(file)
Gaurav Dhama
  • 1,346
  • 8
  • 19