0

First, I already run through several question and answers here about my problem (Error tokenizing data) but I haven't got any working answer because I think my issue is not the same with the one I read.

I have a tab separated data (saved as .xls file) and be able to read thru pandas read_csv but encountered error on specific line during parsing. The data consist of ~58K rows with blank row every other row.

Sample if opened in excel: enter image description here

When reading the file through read_csv:

df = pd.read_csv('8176.xls', header=None, sep='\t', encoding='cp1252')

And this give an error: ParserError: Error tokenizing data. C error: Expected 12 fields in line 13245, saw 13

I read several thread that this could be an header error or a separator error.

I was able to locate the row/line which is encountering error through running my read_csv with nrows and it was in line 6623 of the data.

df = pd.read_csv('8176.xls', header=None, sep='\t', encoding='cp1252', nrows=6623)

Through readlines, I think this is the one causing the error:

There is unwanted tab delimiter (\t) inserted on this line: enter image description here

The same issue when opening the file in excel, the data in that cell is splited and shifted by 1 column. enter image description here

But when opening the file in notepad, it is only a space character not a Tab. enter image description here

Is there any work around to eliminate the unwanted \t?

Or any encoding to read this as space only?

I already tried different encoding but also encountered errors.

Thank you so much.

petezurich
  • 9,280
  • 9
  • 43
  • 57
Ken eXs
  • 51
  • 7
  • Can you post a link to the file in the question? – Сергей Кох Sep 26 '22 at 17:57
  • There are two more functions in pandas for working with tables in txt files - pd.read_table and pd.read_fwf. Maybe what will help? – Сергей Кох Sep 27 '22 at 10:16
  • I'll check the pd.read_table and pd.read_fwf. Here is the sample of file i'm trying to open using read_csv. `https://gofile.io/d/WaOmGF`. In my previous project, I also dealt with .xls file with html formatting that's why I used pd.read_html. – Ken eXs Sep 27 '22 at 15:01
  • While a workaround, but I succeeded. Saved the file as tab-delimited txt despite warnings and applied df = pd.read_table("8176.txt", encoding='cp1252'). Will suit you? – Сергей Кох Sep 27 '22 at 16:37
  • One last thing, one of my objectives here is to eliminate user intervention. Can i do it in python, the opening and saving the file as tab-delimited txt by using `open()` and `save()`? – Ken eXs Sep 27 '22 at 22:58

1 Answers1

0

You can fix this programmatically during read_csv. Check this answer out:

Starting with pandas 1.4.0, read_csv() delivers capability that allows you to handle these situations in a more graceful and intelligent fashion by allowing a callable to be assigned to on_bad_lines=.

Pandas dataframe read_csv on bad data

jch
  • 3,600
  • 1
  • 15
  • 17