Error tokenizing data > How to removed extra \t in a row or line in a data(CSV)

Question

First, I already run through several question and answers here about my problem (Error tokenizing data) but I haven't got any working answer because I think my issue is not the same with the one I read.

I have a tab separated data (saved as .xls file) and be able to read thru pandas read_csv but encountered error on specific line during parsing. The data consist of ~58K rows with blank row every other row.

Sample if opened in excel:

When reading the file through read_csv:

df = pd.read_csv('8176.xls', header=None, sep='\t', encoding='cp1252')

And this give an error: ParserError: Error tokenizing data. C error: Expected 12 fields in line 13245, saw 13

I read several thread that this could be an header error or a separator error.

I was able to locate the row/line which is encountering error through running my read_csv with nrows and it was in line 6623 of the data.

df = pd.read_csv('8176.xls', header=None, sep='\t', encoding='cp1252', nrows=6623)

Through readlines, I think this is the one causing the error:

There is unwanted tab delimiter (\t) inserted on this line:

The same issue when opening the file in excel, the data in that cell is splited and shifted by 1 column.

But when opening the file in notepad, it is only a space character not a Tab.

Is there any work around to eliminate the unwanted \t?

Or any encoding to read this as space only?

I already tried different encoding but also encountered errors.

Thank you so much.

There are two more functions in pandas for working with tables in txt files - pd.read_table and pd.read_fwf. Maybe what will help? — Сергей Кох, Sep 27 '22 at 10:16
I'll check the pd.read_table and pd.read_fwf. Here is the sample of file i'm trying to open using read_csv. `https://gofile.io/d/WaOmGF`. In my previous project, I also dealt with .xls file with html formatting that's why I used pd.read_html. — Ken eXs, Sep 27 '22 at 15:01
While a workaround, but I succeeded. Saved the file as tab-delimited txt despite warnings and applied df = pd.read_table("8176.txt", encoding='cp1252'). Will suit you? — Сергей Кох, Sep 27 '22 at 16:37
One last thing, one of my objectives here is to eliminate user intervention. Can i do it in python, the opening and saving the file as tab-delimited txt by using `open()` and `save()`? — Ken eXs, Sep 27 '22 at 22:58

score 0 · Answer 1 · answered Sep 25 '22 at 17:40

0

You can fix this programmatically during read_csv. Check this answer out:

Starting with pandas 1.4.0, read_csv() delivers capability that allows you to handle these situations in a more graceful and intelligent fashion by allowing a callable to be assigned to on_bad_lines=.

Pandas dataframe read_csv on bad data

answered Sep 25 '22 at 17:40

jch

3,600
1
15
17

Thank you, I will try to learn this one. I'll get back on you. – Ken eXs Sep 26 '22 at 13:56

Error tokenizing data > How to removed extra \t in a row or line in a data(CSV)

1 Answers1