4

I'm using dask to delay computation of some functions that return series in my code-base. Most operations seem to behave as expected so far - apart from my use of np.average.

The function I have returns a pd.Series which I then want to compute a weighted average on.

Below is a non-dask and dask version:

import dask
import numpy as np
import pandas as pd

s = pd.Series([1,2,3])
a = np.average(s, weights=s)
print(a)

ds = dask.delayed(lambda: s)()
a = np.average(ds, weights=ds)
print(a.compute())

The np.average call raises a TypeError: Truth of Delayed objects is not supported.

Unsure what part of my usage is wrong here.

freebie
  • 2,161
  • 2
  • 19
  • 36

1 Answers1

3

The problem is that you are calling a Numpy function np.average on a dask delayed object. The Numpy function has no idea what do to with a Dask Delayed object, so it raises an error. The solution is to delay the numpy function as well.

You can do the following:

a = dask.delayed(np.average)(ds, weights=ds)
a.compute()

This works (you get the answer), but it may well not be what you were after. The single function is being called on the data - you are indeed getting lazy operation and you may get parallelism if you have many such computations. However, I'd say it is pretty unusual to pass around delayed pandas series like this.

You may want to read up on the high level array and data-frame interfaces, where the logic of splitting up series and arrays is done for you.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
mdurant
  • 27,272
  • 5
  • 45
  • 74
  • FWIW I pass around delayed pandas series very often. I think that this is a natural way of using dask and pandas together. Of course, as you say, you'll have to have many delayed calls running at the same time to make it worthwhile. – MRocklin Oct 04 '18 at 15:50
  • Thanks both. With the idea of numpy not knowing about dask - my expirence has been that it can indeed work so `np.sum().compute()` will work. I expected the same with np.average. I also tried directly on the dask implementation `da.average` with same error. In this case the delayed series are not very large so do not need to leverage dask currently there, but there are many of the delayed series so I am leveraging dask on that axis. I've currently solved this by just writting my own simple weighted average using more primitive *, /, and `np.sum`. – freebie Oct 05 '18 at 11:50