Based on the answer I had received on an earlier question, I have written an ETL procedure that looks as follows:
import pandas as pd
from dask import delayed
from dask import dataframe as dd
def preprocess_files(filename):
"""Reads file, collects metadata and identifies lines not containing data.
"""
...
return filename, metadata, skiprows
def load_file(filename, skiprows):
"""Loads the file into a pandas dataframe, skipping lines not containing data."""
return df
def process_errors(filename, skiplines):
"""Calculates error metrics based on the information
collected in the pre-processing step
"""
...
def process_metadata(filename, metadata):
"""Analyses metadata collected in the pre-processing step."""
...
values = [delayed(preprocess_files)(fn) for fn in file_names]
filenames = [value[0] for value in values]
metadata = [value[1] for value in values]
skiprows = [value[2] for value in values]
error_results = [delayed(process_errors)(arg[0], arg[1])
for arg in zip(filenames, skiprows)]
meta_results = [delayed(process_metadata)(arg[0], arg[1])
for arg in zip(filenames, metadata)]
dfs = [delayed(load_file)(arg[0], arg[1])
for arg in zip(filenames, skiprows)]
... # several delayed transformations defined on individual dataframes
# finally: categorize several dataframe columns and write them to HDF5
dfs = dd.from_delayed(dfs, meta=metaframe)
dfs.categorize(columns=[...]) # I would like to delay this
dfs.to_hdf(hdf_file_name, '/data',...) # I would also like to delay this
all_operations = error_results + meta_results # + delayed operations on dask dataframe
# trigger all computation at once,
# allow re-using of data collected in the pre-processing step.
dask.compute(*all_operations)
The ETL-process goes through several steps:
- Pre-process the files, identify lines which do not include any relevant data and parse metadata
- Using information gathered, process error information, metadata and load data-lines into pandas dataframes in parallel (re-using the results from the pre-processing step). The operations (
process_metadata
,process_errors
,load_file
) have a shared data dependency in that they all use information gathered in the pre-processing step. Ideally, the pre-processing step would only be run once and the results shared across processes. - eventually, collect the pandas dataframes into a dask dataframe, categorize them and write them to hdf.
The problem I am having with this is, that categorize
and to_hdf
trigger computation immediately, discarding metadata and error-data which otherwise would be further processed by process_errors
and process_metadata
.
I have been told that delaying operations on dask.dataframes
can cause problems, which is why I would be very interested to know whether it is possible to trigger the entire computation (processing metadata, processing errors, loading dataframes, transforming dataframes and storing them in HDF format) at once, allowing the different processes to share the data collected in the pre-processing phase.