dask.DataFrame.apply and variable length data

Question

I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this:

def generate_varibale_length_series(x):
    '''returns pd.Series with variable length'''
    n_columns = np.random.randint(100)
    return pd.Series(np.random.randn(n_columns))

#apply this function to a dask.DataFrame
pdf = pd.DataFrame(dict(A=[1,2,3,4,5,6]))
ddf = dd.from_pandas(pdf, npartitions = 3)
result = ddf.apply(generate_varibale_length_series, axis = 1).compute()

Apparently, this works fine.

Concerning this, I have two questions:

Is this supposed to work always or am I just lucky here? Is dask expecting, that all partitions have the same amount of columns?
In case the metadata inference fails, how can I provide metadata, if the number of columns is not known beforehand?

Background / usecase: In my dataframe each row represents a simulation trail. The function I want to apply extracts time points of certain events from it. Since I do not know the number of events per trail in advance, I do not know how many columns the resulting dataframe will have.

Edit: As MRocklin suggested, here an approach that uses dask delayed to compute result:

#convert ddf to delayed objects
ddf_delayed = ddf.to_delayed()
#delayed version of pd.DataFrame.apply
delayed_apply = dask.delayed(lambda x: x.apply(generate_varibale_length_series, axis = 1))
#use this function on every delayed object
apply_on_every_partition_delayed = [delayed_apply(d) for d in ddf.to_delayed()]
#calculate the result. This gives a list of pd.DataFrame objects
result = dask.compute(*apply_on_every_partition_delayed)
#concatenate them
result = pd.concat(result)

score 1 · Accepted Answer · answered Dec 14 '16 at 00:11

Short answer

No, dask.dataframe does not support this

Long answer

Dask.dataframe expects to know the columns of every partition ahead of time and it expects those columns to match.

However, you can still use Dask and Pandas together through dask.delayed, which is far more capable of handling problems like these.

http://dask.pydata.org/en/latest/delayed.html

dask.DataFrame.apply and variable length data

1 Answers1

Short answer

Long answer