On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

Question

In the below code snippet, I would expect the logs to print the numbers 0 - 4. I understand that the numbers may not be in that order, as the task would be broken up into a number of parallel operations.

Code snippet:

from dask import dataframe as dd
import numpy as np
import pandas as pd

df = pd.DataFrame({'A': np.arange(5),
                   'B': np.arange(5),
                   'C': np.arange(5)})

ddf = dd.from_pandas(df, npartitions=1)

def aggregate(x):
    print('B val received: ' + str(x.B))
    return x

ddf.apply(aggregate, axis=1).compute()

But when the above code is run, I see this instead:

B val received: 1
B val received: 1
B val received: 1
B val received: 0
B val received: 0
B val received: 1
B val received: 2
B val received: 3
B val received: 4

Instead of 0 - 4, I see a series of 1 printed first, and an extra 0. I have noticed the "extra" rows of value 1 occurring every time I have set up a Dask DataFrame and run an apply operation on it.

Printing the dataframe shows no additional rows with value 1 throughout:

My question is: Where are these rows with value 1 coming from? Why do they appear to consistently occur prior to the "actual" rows in the dataframe? The 1 values seem unrelated to the values in the actual rows (that is, it is not as though it is for some reason grabbing the second row an extra few times).

score 8 · Answer 1 · answered Apr 14 '17 at 18:22

@Grr 's answer is correct. Dask.dataframe doesn't know what your function will produce, but still has to provide a lazy dask.dataframe for you with the correct types, dtypes, etc., so it tries your function on a little bit of data.

You can avoid these checks by providing metadata about your intended output using the meta= keyword (more details in the DataFrame.apply docstring). If you provide this information then Dask.dataframe will not need to try your function to determine types.

Copying this section here:

Docstring

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Solution

So if you create an example output as an empty dataframe then you'll be fine:

meta = pd.DataFrame({'A': [1], 'B': [2], 'C': [3]}, 
                    columns=['A', 'B', 'C'])
ddf.apply(aggregate, axis=1, meta=meta)

Or, in this case because your function doesn't change the columns or dtype of the input, you can just use the input's meta

ddf.apply(aggregate, axis=1, meta=ddf.meta)

score 5 · Accepted Answer · answered Apr 14 '17 at 18:05

5

Dask does some checking on what you have told it to do before it tries to do it on the entire collection of partitions. That is where the first few print statements are coming from. It's part of the built in error checking that prevents Dask from going down some long winded series of operations and failing at the end.

answered Apr 14 '17 at 18:05

Grr

15,553
7
65
85

Thanks for the super fast answer. While this may help in error checking, it makes it quite difficult to actually execute my lambda function. For example, I am attempting to use a GeoDataFrame within Dask and want to run a buffer and bounds calculation on a certain geometry column, like so: `row.buffer.bounds`. This operation errors when fed something other than a Shapely object. I realize Dask does not explicitly support Geopandas, but this type constraint seems like an "intense" blocker. Any ideas on workarounds in this or similar situations? – kuanb Apr 14 '17 at 18:13
1

That one will be beyond me. You could open an issue with an explicit and minimal example on their github or try to get in touch with @MRocklin. He seems to be the most active contributor on their github as well as here to high level dask questions. – Grr Apr 14 '17 at 18:17

On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

2 Answers2

Docstring

Solution

Linked