Why does Dask fill in "foo" and 1 in my Dataframe

Question

I've read in around 15 csv files:

df = dd.read_csv("gs://project/*.csv", blocksize=25e6,
                 storage_options={'token': fs.session.credentials})

Then I persisted the Dataframe (it uses 7.33 GB memory):

df = df.persist()

I set a new index because I want my group by on that field to be as efficient as possible:

df = df.set_index('column_a').persist()

Now I have 181 divisions and 180 partitions. To try out how fast my group by was going I tried a custom apply function that just prints the Group Dataframe:

grouped_by_index = df.groupby('column_a').apply(lambda n: print(n)).compute()

That printed a Dataframe with correct columns but the values are either "1", "foo" or "True". Example:

column_b  column_c column_d  column_e  column_f  column_g  \
index                                                                   
a          foo           1      foo        1           1           1

I also get the warning:

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: meta is not specified, inferred from partial data. Please provide meta if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8'))
for series result """Entry point for launching an IPython kernel.

What is going on here?

score 5 · Accepted Answer · answered Feb 08 '19 at 16:04

5

Indeed, if you read the docs for apply, you will see that meta= is a parameter that you can pass, which tells Dask how to expect the output of the operation to look. This is necessary because apply can do very general things.

If you don't supply meta=, as in your case, than Dask will try to seed the operation with an example mini-dataframe containing 1 for any numerical columns and "foo" for text ones, just to see what the output will be like. Since in your apply you print (and don't actually return anything), you are seeing this seed.

As suggested by the documentation, you are always better off providing meta= when possible, and then a whole step in the process can be avoided.

answered Feb 08 '19 at 16:04

mdurant

27,272
5
45
74

When I provide meta data I get the error: `AttributeError: 'DataFrame' object has no attribute 'name'`. I define meta data like this: `meta = {'project': 'object', 'net': 'int64', 'no': 'object', 'v': 'int64', 'min': 'int64', 'avg': 'int64', 'associated': 'bool'}`. – Stanko Feb 08 '19 at 21:37
I thought that ought to work, but you'd be best off providing a zero-length dataframe. – mdurant Feb 08 '19 at 22:22
I defined meta like this: `meta=('visits', object)` and that did it trick, thanks for the help. – Stanko Feb 09 '19 at 12:31
1

Ah yes - it was the *output* that you are defining, which is just a single set of `None` values (because you don't return anything from the lambda). – mdurant Feb 09 '19 at 14:05
Even after specifying meta as "object" in case of apply function returning a dictionary, parameter values being passed is "foo" for empty cells. This doesn't seem to work correctly. Thanks – jsanjayce Apr 28 '23 at 14:52

Why does Dask fill in "foo" and 1 in my Dataframe

1 Answers1