4

I encountered a very strange error having to do with assigning a new column to an existing dask dataframe. Given the below minimal example,

import pandas as pd
from dask import dataframe as dd
from dask import array as da

foo = dd.from_pandas(pd.DataFrame({'number':list(range(10))}), chunksize=2)
add_me = ["N/A" for _ in range(len(foo.index))]
add_me = da.from_array(add_me, chunks='auto').compute()

I'd expect the following to work just fine

foo = foo.assign(added=lambda x: add_me[x['number']]).compute()

However, this throws the following error:

ValueError: Length of values does not match length of index

When I compute the dataframe first and then add the column using the same syntax (only then in native pandas), it works just fine:

foo = foo.compute()
foo = foo.assign(added=lambda x: add_me[x['number']])
foo
>>>>>     number added
      0     0     N/A
      1     1     N/A
      2     2     N/A
      3     3     N/A
      4     4     N/A
      5     5     N/A
      6     6     N/A
      7     7     N/A
      8     8     N/A
      9     9     N/A

Am I missing something here?

I read the following related post (Dask error: Length of values does not match length of index), but didn't find helpful advice.

emilaz
  • 1,722
  • 1
  • 15
  • 31

1 Answers1

0

Assuming that the actual operation you are performing is row-wise (does not depend on the values of other rows), then the dask version of the code can be refactored to:

foo = foo.map_partitions(lambda df: df.assign(added=lambda x: add_me[x['number']]))

Or in a slightly more readable version:

def add_custom(df):
    return df.assign(added=lambda x: add_me[x['number']])

foo = foo.map_partitions(add_custom)
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46