I wish to combine many dataframes into 1 dataframe with dask. However when I try to read those dataframes with dd.from_delayed(parts, meta=types) I get the error Metadata mismatch found in 'from_delayed'
.
The full error:
Metadata mismatch found in `from_delayed`.
Partition type: `pandas.core.frame.DataFrame`
+--------+-------+----------+
| Column | Found | Expected |
+--------+-------+----------+
| 'col3' | - | object |
+--------+-------+----------+
I know this is because the dataframes I wish to combine do not have the same columns. Data that not exists in a column should be marked as NA. Setting verify_meta=False will silence these errors, but will lead to issues downstream since some of the partitions don't match the metadata.
The code:
import pandas as pd
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from dask import delayed
import os
def dict_to_dataframe(dict):
return pd.DataFrame.from_dict(dict)
data_a = {'col1': [[1, 2, 3, 4], [5, 6, 7, 8]], 'col2': [[9, 10, 11, 12], [13, 14, 15, 16]]}
data_b = {'col1': [[17, 18, 19, 20], [21, 22, 23, 24]], 'col3': [[25, 26, 27, 28], [29, 30, 31, 32]]}
parts = [delayed(dict_to_dataframe)(fn) for fn in [data_a, data_b]]
types = pd.DataFrame(columns=['col1', 'col2', 'col3'], dtype=object)
ddf_result = dd.from_delayed(parts, meta=types)
print()
print('Write to file')
file_path = os.path.join('test.hdf')
with ProgressBar():
ddf_result.compute().sort_index().to_hdf(file_path, key=type, format='table')
written = dd.read_hdf(file_path, key=type)