0

I wish to combine many dataframes into 1 dataframe with dask. However when I try to read those dataframes with dd.from_delayed(parts, meta=types) I get the error Metadata mismatch found in 'from_delayed'.

The full error:

Metadata mismatch found in `from_delayed`.

Partition type: `pandas.core.frame.DataFrame`
+--------+-------+----------+
| Column | Found | Expected |
+--------+-------+----------+
| 'col3' | -     | object   |
+--------+-------+----------+

I know this is because the dataframes I wish to combine do not have the same columns. Data that not exists in a column should be marked as NA. Setting verify_meta=False will silence these errors, but will lead to issues downstream since some of the partitions don't match the metadata.

The code:

import pandas as pd
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from dask import delayed
import os

def dict_to_dataframe(dict):
    return pd.DataFrame.from_dict(dict)


data_a = {'col1': [[1, 2, 3, 4], [5, 6, 7, 8]], 'col2': [[9, 10, 11, 12], [13, 14, 15, 16]]}        
data_b = {'col1': [[17, 18, 19, 20], [21, 22, 23, 24]], 'col3': [[25, 26, 27, 28], [29, 30, 31, 32]]}

parts = [delayed(dict_to_dataframe)(fn) for fn in [data_a, data_b]]
types = pd.DataFrame(columns=['col1', 'col2', 'col3'], dtype=object)
ddf_result = dd.from_delayed(parts, meta=types)

print()
print('Write to file')
file_path = os.path.join('test.hdf')
with ProgressBar():
    ddf_result.compute().sort_index().to_hdf(file_path, key=type, format='table')

written = dd.read_hdf(file_path, key=type)
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
Sam
  • 338
  • 1
  • 4
  • 17
  • 1
    Please ask one question at a time. You need to align the dataframes first - they can’t have different columns. And you can set a subset of columns as string type with e.g. `df[[col1, col3]] = df[[col1, col3]].astype("string")` – Michael Delgado Feb 28 '23 at 11:42

2 Answers2

0

To allow use of delayed for creation of a single dataframe, it's possible to enforce a specific schema during formation of the delayed objects. Building on a relevant answer, the dict_to_dataframe can be modified to incorporate the business logic, for example:

def dict_to_dataframe(dict, common_columns):
    df = pd.DataFrame.from_dict(dict)
    # keep only the columns known to be common to all dataframes
    df = df[common_columns]
    return df

Make an appropriate modification to the meta example, the rest of the code should work. Here's the full snippet:

import dask.dataframe as dd
import pandas as pd
from dask import delayed
from dask.diagnostics import ProgressBar


def dict_to_dataframe(dict, common_columns):
    df = pd.DataFrame.from_dict(dict)
    # keep only the columns known to be common to all dataframes
    df = df[common_columns]
    ...
    return df


data_a = {
    "col1": [[1, 2, 3, 4], [5, 6, 7, 8]],
    "col2": [[9, 10, 11, 12], [13, 14, 15, 16]],
}
data_b = {
    "col1": [[17, 18, 19, 20], [21, 22, 23, 24]],
    "col3": [[25, 26, 27, 28], [29, 30, 31, 32]],
}

common_columns = ["col1"]

parts = [delayed(dict_to_dataframe)(fn, common_columns) for fn in [data_a, data_b]]
types = pd.DataFrame(columns=common_columns, dtype=object)
ddf_result = dd.from_delayed(parts, meta=types)

print("Write to file")
file_path = "test.csv"
with ProgressBar():
    ddf_result.compute().sort_index().to_csv(file_path, index=False)

written = dd.read_csv(file_path)
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Thanks for your answer! In fact I try to combine all the data instead of only keeping the common columns. Columns that not exist in the other dataframe should receive NA values. – Sam Feb 28 '23 at 12:11
  • That's a good exercise for you to check if you understand this answer well! – SultanOrazbayev Feb 28 '23 at 12:13
  • I understand that it is not possible to combine the dataframes when the columns do not match. Apart from that doesn't your answer do what I wish to perform. – Sam Feb 28 '23 at 12:16
0

The dataframes you want to combine need to have the same columns. A solution can be adding the columns that are missing to each dataframe so that they can be combined.

import dask.dataframe as dd
import pandas as pd
from dask import delayed
from dask.diagnostics import ProgressBar

def dict_to_dataframe(dict, all_columns):
    df = pd.DataFrame.from_dict(dict)

    # Add missing columns and sort columns   
    missing_columns = list(set(all_columns).difference(df.columns)) 
    df = df.reindex(columns=sorted([*df.columns.tolist(), *missing_columns]))

    # Set new column type to object
    df[missing_columns] = df[missing_columns].astype(object)

    return df


data_a = {
    101: [[1, 2, 3, 4], [5, 6, 7, 8]],
    110: [[9, 10, 11, 12], [13, 14, 15, 16]],
}
data_b = {
    105: [[17, 18, 19, 20], [21, 22, 23, 24]],
    130: [[25, 26, 27, 28], [29, 30, 31, 32]],
}


all_columns = [101, 105, 110, 130, 140]
parts = [delayed(dict_to_dataframe)(fn, all_columns) for fn in [data_a, data_b]]
types = pd.DataFrame(columns=all_columns, dtype=object)
ddf_result = dd.from_delayed(parts, meta=types)

with ProgressBar():
    ddf_result.compute().sort_index().to_csv(file_path, index=False)
Sam
  • 338
  • 1
  • 4
  • 17