-1

I am using Dask to apply a function myfunc that adds two new columns new_col_1 and new_col_2 to my Dask dataframe data. This function uses two columns a1 and a2 for computing the new columns.

ddata[['new_col_1', 'new_col_2']] = ddata.map_partitions(
lambda df: df.apply((lambda row: myfunc(row['a1'], row['a2'])), axis=1, 
                    result_type="expand")).compute()  

This gives the following error:

ValueError: Metadata inference failed in `lambda`.

You have supplied a custom function and Dask is unable to  determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.

How can I provide the meta keyword for this scenario?

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
S_S
  • 1,276
  • 4
  • 24
  • 47

2 Answers2

1

meta can be provided via kwarg to .map_partitions:

some_result = dask_df.map_partitions(some_func, meta=expected_df)

expected_df could be specified manually, or alternatively you could compute it explicitly on a small sample of data (in which case it will be a pandas dataframe).

There are more details in the docs.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • The function `myfunc` returns two variables which are added as columns `new_col_1` and `new_col_2`, these can be string or JSON. Since the function does not return a `DataFrame` how do I specify `meta`? – S_S Jan 24 '22 at 05:58
  • 1
    Ah, true, meta is more flexible that just dataframes. I guess in your case it would have t o be a tuple of series (that are each a string or json). – SultanOrazbayev Jan 24 '22 at 06:29
1

Sultan's answer is perfect about using meta. :)

You can also avoid using map_partitions here because Dask implements apply, which calls map_partitions internally:

import json
import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'x': range(1,5),
                   'y': range(6,10),
                  }).astype('str')

ddf = dd.from_pandas(df, npartitions=2)

def myfunc(x):
    s = "string: " + x[0]
    j = json.dumps({'json': x[1]})
    return [s, j]

ddf[['new_col_1', 'new_col_2']] = ddf.apply(myfunc, axis=1, result_type="expand", meta={0: 'object', 1: 'object'})

ddf.compute()

# Output of ddf.compute():
#
#    x  y  new_col_1      new_col_2
# 0  1  6  string: 1  {"json": "6"}
# 1  2  7  string: 2  {"json": "7"}
# 2  3  8  string: 3  {"json": "8"}
# 3  4  9  string: 4  {"json": "9"}

Also, in your code snippet, calling .compute() will create a pandas DataFrame, and hence, you'll get an error if you assign it to a Dask DataFrame (ddata). I'd suggest calling compute on ddata after assignment.

pavithraes
  • 724
  • 5
  • 9
  • 1
    Thank you for the suggestion. I can use `apply`. – S_S Jan 25 '22 at 10:14
  • When I use `ddf.apply` directly without using `map_partitions` and then call `compute`, is there a way to skip those rows that may generate any exceptions? – S_S Jan 27 '22 at 08:01
  • I don't think there's anything dask-specific for this, but maybe you can do a check in the function (`myfunc`) to make sure it'll work correctly? Also, it's probably worth asking a new question? – pavithraes Jan 30 '22 at 13:10