How to skip bad lines when reading with dask?

Question

I am trying to read a .txt with dask (7 million rows approximately). However, there are like 4000 rows that mismatch the dtype of the column:

+-----------------------------+--------+----------+
| Column                      | Found  | Expected |
+-----------------------------+--------+----------+
| Pro_3FechaAprobacion        | object | int64    |
| Pro_3FechaCancelContractual | object | int64    |
| Pro_3FechaDesembolso        | object | int64    |
+-----------------------------+--------+----------+

The following columns also raised exceptions on conversion:

- Pro_3FechaAprobacion
  ValueError("invalid literal for int() with base 10: '200904XX'")
- Pro_3FechaCancelContractual
  ValueError("invalid literal for int() with base 10: '        '")
- Pro_3FechaDesembolso
  ValueError("invalid literal for int() with base 10: '200904XX'")

I know these are date columns, and they are formatted like %Y%m%d but some records are like %Y%mXX. I want to skip these as when I use:

df = pd.read_csv("file.txt",error_bad_lines=False)

Is there any way to this in dask?

score 0 · Answer 1 · answered Aug 12 '19 at 12:12

The error_bad_lines=False keyword is taken from pandas.read_csv. I don't think that it supports the behavior that you want. You might consider asking this same question with the pandas tag instead to see if people familiar with Pandas' read_csv function can provide some suggestions. The dask.dataframe.read_csv function just uses that code.

How to skip bad lines when reading with dask?

1 Answers1