I encountered a very strange error having to do with assigning a new column to an existing dask dataframe. Given the below minimal example,
import pandas as pd
from dask import dataframe as dd
from dask import array as da
foo = dd.from_pandas(pd.DataFrame({'number':list(range(10))}), chunksize=2)
add_me = ["N/A" for _ in range(len(foo.index))]
add_me = da.from_array(add_me, chunks='auto').compute()
I'd expect the following to work just fine
foo = foo.assign(added=lambda x: add_me[x['number']]).compute()
However, this throws the following error:
ValueError: Length of values does not match length of index
When I compute the dataframe first and then add the column using the same syntax (only then in native pandas), it works just fine:
foo = foo.compute()
foo = foo.assign(added=lambda x: add_me[x['number']])
foo
>>>>> number added
0 0 N/A
1 1 N/A
2 2 N/A
3 3 N/A
4 4 N/A
5 5 N/A
6 6 N/A
7 7 N/A
8 8 N/A
9 9 N/A
Am I missing something here?
I read the following related post (Dask error: Length of values does not match length of index), but didn't find helpful advice.