3

I would like to return an empty dataframe/ None from a set of delayed tasks where parsing fails, e.g.;

import dask.dataframe as dd
import dask.delayed

def _read(self, filename):
    try:
        df = pd.read_csv(filename, sep=';', decimal=',', encoding='latin1', index_col=False)
        return df
    except BaseException as e:
        return pd.DataFrame()


tasks = []
for root, dirs, files in os.walk(os.path.join(self._path, "files")):
    for file in files:
        tasks.append(dask.delayed(_read, pure=True)(os.path.join(root, file)))

ddf = dd.from_delayed(tasks)

One or two of the files fail being parsed, and at the moment I receive a metadata mismatch. I could return a dataframe with the dask dataframe metadata specified, but just wondering if there's a better way.

morganics
  • 1,209
  • 13
  • 27
  • Can you make a dataframe with the right columns/dtypes but zero rows, perhaps from the first file that parses successfully? – mdurant Nov 28 '17 at 16:31

2 Answers2

3

Going with the comment from @mdurant, it's not as easy as you'd expect to copy a dataframe maintaining types, but this seems to work. This wouldn't work if your first file errors out, of course.

import dask.dataframe as dd
import dask.delayed

_default_record = None

def _read(self, filename):
    global _default_record
    try:
        df = pd.read_csv(filename, sep=';', decimal=',', encoding='latin1', index_col=False)
        if _default_record is None:
            _default_record = pd.DataFrame.from_items([
                        (name, pd.Series(data=None, dtype=series.dtype))
                              for name, series in df.head(1).iteritems()])
        return df
    except BaseException as e:
        return _default_record


tasks = []
for root, dirs, files in os.walk(os.path.join(self._path, "files")):
    for file in files:
        tasks.append(dask.delayed(_read, pure=True)(os.path.join(root, file)))

ddf = dd.from_delayed(tasks)
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
morganics
  • 1,209
  • 13
  • 27
  • when I try this code, I get this error: UnboundLocalError: local variable '_default_record' referenced before assignment any ideas? – EMiller Jan 21 '21 at 00:26
0

Answer from @morganics Updated for what I'm assuming are newer versions of pandas(1.1.5) and dask (2020.12.0).

import dask.dataframe as dd
import dask.delayed
import pandas as pd

_default_record = None

def _read(self, filename):
    global _default_record
    try:
        df = pd.read_csv(filename, sep=';', decimal=',', encoding='latin1', index_col=False)
        if _default_record is None:
            _default_record = pd.DataFrame([
                        {name: pd.Series(data=None, dtype=series.dtype)
                              for name, series in df.head(1).iteritems()})
        return df
    except BaseException as e:
        return _default_record


tasks = []
for root, dirs, files in os.walk(os.path.join(self._path, "files")):
    for file in files:
        tasks.append(dask.delayed(_read, pure=True)(os.path.join(root, file)))

ddf = dd.from_delayed(tasks)

I made _default_record a global variable, and I removed the from_items method from the dataframe constructor, because it doesn't exist in my version of pandas.

EMiller
  • 817
  • 1
  • 7
  • 20