3

How do I compute the first discrete difference using Dask DataFrame? Or, in "Pandas speak", how do I do pandas.DataFrame.diff() in Dask? Mathematically, the operation is very simple: subtract a column vector from a copy of itself shifted by one or more rows.

I have tried implementing diff() in Dask in the following ways, none of which works (yet):

  • df - df.shift(periods=1) works in Pandas. But Dask DataFrame doesn't have a shift() method.
  • df.values[:-1] - df.values[1:] works in Pandas. But I can't see how to index into a Dask DataFrame by position.

My current best idea for implementing diff would be to wrap some custom code in dask.dataframe.rolling.wrap_rolling, as suggested in this stack overflow answer (although I haven't been able to find any documentation on how to do this). Or wrap some custom code using Dask Delayed? Any other thoughts?

Community
  • 1
  • 1
Jack Kelly
  • 2,214
  • 2
  • 22
  • 32
  • Yup, I would recommend using wrap_rolling. If you [raise an issue](https://github.com/dask/dask/issues/new) to make this into user-accessible API I suspect that someone would take it on. (or maybe this is something that you would like to contribute to help others?) – MRocklin Nov 08 '16 at 13:16
  • @MRocklin thanks for the suggestion! I have just created [a feature request on Dask's issue queue](https://github.com/dask/dask/issues/1765). – Jack Kelly Nov 08 '16 at 13:46

1 Answers1

3

The diff method has now been added to both DataFrame and Series, in this PR: https://github.com/dask/dask/pull/1769. Works the same as it does in pandas.

jiminy_crist
  • 2,395
  • 2
  • 17
  • 23