0

I am looking for the best way to compute many dask delayed obejcts stored in a dataframe. I am unsure if the pandas dataframe should be converted to a dask dataframe with delayed objects within, or if the compute call should be called on all values of the pandas dataframe.

I would appreciate any suggestions in general, as I am having trouble with the logic of passing delayed object across nested for loops.

import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from dask import delayed, compute

steps = 5
sample = [int(x) for x in np.linspace(5, 100, num=steps)]
enr_df = pd.DataFrame()

for N in sample:
    enr = []
    for i in range(20):
        k = np.random.randint(1, 200)
        enr.append(delayed(hypergeom.sf)(k=k, M=10000, n=20, N=N, loc=0))
    enr_df[N] = enr

I cannot call compute on this dataframe without applying the function across all cells like so: enr_df.applymap(compute) (which I believe calls compute on each value individually).

However if I convert to a dask dataframe the delayed objects I want to compute are layered in the dask dataframe structure:

enr_dd = dd.from_pandas(enr_df, npartitions=1)
enr_dd.compute()

And the computation output I expect does not proceed.

skurp
  • 389
  • 3
  • 13

1 Answers1

1

You can pass a list of delayed objects into dask.compute

results = dask.compute(*list_of_delayed_objects)

So you need to get a list from your Pandas dataframe. This is something you can do with normal Python code.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Right, will performance be best if all delayed objects are stored in a list, `dask.compute()` is called on this, and then reformatted? Even if the list is very large? – skurp Aug 12 '19 at 17:40
  • The performance will be no different. Dask has around a 300us overhead per task, regardless of how the task is presented. – MRocklin Aug 13 '19 at 13:03