4

What're the logistics behind having the extra .compute() in the numpy and pandas mimicked functionalities? Is it just to support some kind of lazy evaluation?

Example from Dask documentation below:

import pandas as pd                     import dask.dataframe as dd
df = pd.read_csv('2015-01-01.csv')      df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean()     df.groupby(df.user_id).value.mean().compute()
Ryan McCormick
  • 246
  • 4
  • 14

1 Answers1

1

Yes, your intution is correct here. Most Dask collections (array, bag, dataframe, delayed) are lazy by default. Normal operations are lazy while calling compute actually triggers execution.

This is important both so that we can do bit of light optimization, and also so that we can support low-memory execution. For example, if you were to call

x = da.ones(1000000000000)
total = x.sum()

And if we ran immediately then there would be a time where we thought that you wanted the full array computed, which would be unfortunate if you were on a single machine. But if we know that you only want total.compute() then we can compute this thing in much smaller memory.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • So is `compute()` a way of catching that it's actually simple to get the sum of this array and unnecessary to store the array in memory? And with this in mind it just uses a generator to compute the sum or something? @MRocklin – Ryan McCormick Jan 31 '19 at 02:12
  • Yes, something like that, but replace "generator" with a more complex system that can handle arbitrary task graphs :) – MRocklin Jan 31 '19 at 02:50