3

I am trying to read a .txt with dask (7 million rows approximately). However, there are like 4000 rows that mismatch the dtype of the column:

+-----------------------------+--------+----------+
| Column                      | Found  | Expected |
+-----------------------------+--------+----------+
| Pro_3FechaAprobacion        | object | int64    |
| Pro_3FechaCancelContractual | object | int64    |
| Pro_3FechaDesembolso        | object | int64    |
+-----------------------------+--------+----------+

The following columns also raised exceptions on conversion:

- Pro_3FechaAprobacion
  ValueError("invalid literal for int() with base 10: '200904XX'")
- Pro_3FechaCancelContractual
  ValueError("invalid literal for int() with base 10: '        '")
- Pro_3FechaDesembolso
  ValueError("invalid literal for int() with base 10: '200904XX'")

I know these are date columns, and they are formatted like %Y%m%d but some records are like %Y%mXX. I want to skip these as when I use:

df = pd.read_csv("file.txt",error_bad_lines=False)

Is there any way to this in dask?

rpanai
  • 12,515
  • 2
  • 42
  • 64
davidaap
  • 1,569
  • 1
  • 18
  • 43

1 Answers1

0

The error_bad_lines=False keyword is taken from pandas.read_csv. I don't think that it supports the behavior that you want. You might consider asking this same question with the pandas tag instead to see if people familiar with Pandas' read_csv function can provide some suggestions. The dask.dataframe.read_csv function just uses that code.

MRocklin
  • 55,641
  • 23
  • 163
  • 235