2

In dask I am getting the error: "ValueError: The columns in the computed data do not match the columns in the provided metadata Order of columns does not match"

This does not make sense to me as I do provide metadata that is correct. It is not ordered as it is provided in a dict.

A minimal working example is below:

from datetime import date
import pandas as pd
import numpy as np
from dask import delayed
import dask.dataframe as dsk

# Making example data
values = pd.DataFrame({'date' : [date(2020,1,1), date(2020,1,1), date(2020,1,2), date(2020,1,2)], 'id' : [1,2,1,2], 'A': [4,5,2,2], 'B':[7,3,6,1]})
def get_dates():
    return pd.DataFrame({'date' : [date(2020,1,1), date(2020,1,1), date(2020,1,2), date(2020,1,2)]})
def append_values(df):
    df2 = pd.merge(df, values, on = 'date', how = 'left')
    return df2
t0 = pd.DataFrame({'date' : [date(2020,1,1), date(2020,1,1), date(2020,1,2), date(2020,1,2)]})
t1 = delayed(t0)
t2 = dsk.from_delayed(t1)
t = t2.map_partitions(append_values, meta = {'A' : 'f8', 'B': 'i8', 'id' : 'i8', 'date' : 'object'}, enforce_metadata = False)

# Applying a grouped function.
def func(x,y):
    return pd.DataFrame({'summ' : [np.mean(x) + np.mean(y)], 'difference' : [int(np.floor(np.mean(x) - np.mean(y)))]})

# Everything works when I compute the dataframe before doing the apply. But I want to distribute the apply so I dont like this option.
res = t.compute().groupby(['date']).apply(lambda df: func(df['A'], df['B']))
# This fails as the meta is out of order. But the meta is in a dict and is hence not supposted to be ordered anyway!
res = t.groupby(['date']).apply(lambda df: func(df['A'], df['B'])).compute()

What did I do wrong here and how do I fix it? While one workaround is to compute before doing the grouping operation, this is not feasible for my actual case (where there is too much data to hold it in RAM).

One other question that may be related but I dont think it is : ValueError: The columns in the computed data do not match the columns in the provided metadata . This seems to be related to csv parsing with dask

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
Stuart
  • 1,322
  • 1
  • 13
  • 31

1 Answers1

2

The order of keys in the dict supplied to meta does seem to matter. Changing the order as below, will yield a warning only:

    # changing the order of keys in this dict
    # meta={"date": "object", "id": "i8", "B": "i8", "A": "f8", },
    meta={"date": "object", "id": "i8", "A": "f8", "B": "i8"},

My guess is that Dask uses internally the order of keys to construct the meta dataframe, but not quite sure. The thing is that after t.compute() the df is pandas dataframe, so subsequent groupby knows what columns to pick (not relying on order), while before .compute, the dataframe is still a dask dataframe (lazy) and dask is trying to look for a column with the order given in meta (and then sees a mismatch)...

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Seems weird though as `t.compute()` works. It just throws this ordering error if you do not compute the dataframe before the apply. – Stuart May 09 '22 at 14:01
  • The thing is that after `t.compute()` the df is `pandas`, so subsequent groupby knows what to pick, while before compute, my guess is `dask` trying to look for a column with the order given in meta (and then sees a mismatch)... this is probably something that can be fixed as a PR... – SultanOrazbayev May 09 '22 at 15:15
  • I'm not sure about this but I'd guess that some part of the code is working with the dictionary as if it's a dataframe, but that depending on the dictionary to be sorted in column order isn't an intended behavior. This could be worth [filing an issue with dask](https://docs.dask.org/en/latest/develop.html#issues) on https://github.com/dask/dask/issues if you're up for it @Stuart. – Michael Delgado May 09 '22 at 19:24