I am trying to read a .txt with dask (7 million rows approximately). However, there are like 4000 rows that mismatch the dtype of the column:
+-----------------------------+--------+----------+
| Column | Found | Expected |
+-----------------------------+--------+----------+
| Pro_3FechaAprobacion | object | int64 |
| Pro_3FechaCancelContractual | object | int64 |
| Pro_3FechaDesembolso | object | int64 |
+-----------------------------+--------+----------+
The following columns also raised exceptions on conversion:
- Pro_3FechaAprobacion
ValueError("invalid literal for int() with base 10: '200904XX'")
- Pro_3FechaCancelContractual
ValueError("invalid literal for int() with base 10: ' '")
- Pro_3FechaDesembolso
ValueError("invalid literal for int() with base 10: '200904XX'")
I know these are date columns, and they are formatted like %Y%m%d but some records are like %Y%mXX. I want to skip these as when I use:
df = pd.read_csv("file.txt",error_bad_lines=False)
Is there any way to this in dask?