I am looking for the best way to compute many dask
delayed
obejcts stored in a dataframe. I am unsure if the pandas
dataframe should be converted to a dask
dataframe with delayed
objects within, or if the compute
call should be called on all values of the pandas
dataframe.
I would appreciate any suggestions in general, as I am having trouble with the logic of passing delayed
object across nested for loops.
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from dask import delayed, compute
steps = 5
sample = [int(x) for x in np.linspace(5, 100, num=steps)]
enr_df = pd.DataFrame()
for N in sample:
enr = []
for i in range(20):
k = np.random.randint(1, 200)
enr.append(delayed(hypergeom.sf)(k=k, M=10000, n=20, N=N, loc=0))
enr_df[N] = enr
I cannot call compute
on this dataframe without applying the function across all cells like so: enr_df.applymap(compute)
(which I believe calls compute
on each value individually).
However if I convert to a dask
dataframe the delayed
objects I want to compute are layered in the dask
dataframe structure:
enr_dd = dd.from_pandas(enr_df, npartitions=1)
enr_dd.compute()
And the computation output I expect does not proceed.