I would like to return an empty dataframe/ None from a set of delayed tasks where parsing fails, e.g.;
import dask.dataframe as dd
import dask.delayed
def _read(self, filename):
try:
df = pd.read_csv(filename, sep=';', decimal=',', encoding='latin1', index_col=False)
return df
except BaseException as e:
return pd.DataFrame()
tasks = []
for root, dirs, files in os.walk(os.path.join(self._path, "files")):
for file in files:
tasks.append(dask.delayed(_read, pure=True)(os.path.join(root, file)))
ddf = dd.from_delayed(tasks)
One or two of the files fail being parsed, and at the moment I receive a metadata mismatch. I could return a dataframe with the dask dataframe metadata specified, but just wondering if there's a better way.