Debugging CSV file

Question

I'm dealing with an issue with a CSV file - my code runs perfectly with the old file. But I've recently updated the file to include more websites my script can scrape and now my code is running into an error:

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x83 in position 5266: invalid start byte

I'm sure the issue lies with CSV file - but I'm not sure how to find the character/line that's causing the issue.

Does anyone have any suggestions on how to find 0x83 position 5266?

Thanks and best,

score 0 · Answer 1 · answered Aug 06 '21 at 08:10

0

In notepad++ you can se position in the bottom bar:

You can also change which encoding to view the text in (if it appears corrupt) and convert to utf-8 and save to fix the encoding.

answered Aug 06 '21 at 08:10

Mabbus

50
5

And to expand on the UTF-8 encoding: For letters (UNICODE 0-127) it only uses one byte. For the other UNICODE characters it uses multiple bytes, and the first bits are used to tell how many bytes to use (up to 4). 110xxxxx means two bytes. 1110xxxx means three, 11110xxx means four. 10xxxxxx means it is a not the first byte in a letter. 0x83 is 131 in decimal and 10000011 in binary. That makes the UTF-8 decoder check the byte before, and if it is not 1xxxxxxx it is not valid UTF-8. The file is perhaps saved with a local encoding, where 131 corresponds to some special character like ü. – Mabbus Aug 06 '21 at 08:41

Debugging CSV file

1 Answers1