A Utf8 encoded file produces UnicodeDecodeError during parsing

Question

I'm trying to reformat a text file so I can upload it to a pipeline (QIIME2) - I tested the first few lines of my .txt file (but it is tab separated), and the conversion was successful. However, when I try to run the script on the whole file, I encounter an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 16: invalid start byte

I have identified that the file encoding is Utf8, so I am not sure where the problem is arising from.

$ file filename.txt
filename: UTF-8 Unicode text, with very long lines, with CRLF line terminator

I have also reviewed some of the lines that are associated with the error, and I am not able to visually identify any unorthodox characters.

I have tried to force encode it using:

$iconv -f UTF8 -t UTF8 filename.txt > new_file.txt

However, the error produced is:

iconv: illegal input sequence at position 152683

How I'm understanding this is that whatever character occurs at the position is not readable/translatable using utf-8 encoding, but I am not sure then why the file is said to be encoded in utf-8.

I am running this on Linux, and the data itself are sequence information from the BOLD database (if anyone else has run into similar problems when trying to convert this into a format appropriate for QIIME2).

Please show the relevant code and state the exact problem or error. Please include the data. Also see [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). — jww, May 21 '19 at 20:21

score 1 · Answer 1 · answered May 21 '19 at 20:45

file is wrong. The file command doesn't read the entire file. It bases its guess on some sample of the file. I don't have a source ref for this, but file is so fast on huge files that there's no other explanation.

I'm guessing that your file is actually UTF-8 in the beginning, because UTF-8 has characteristic byte sequences. It's quite unlikely that a piece of text only looks like UTF-8 but isn't actually.

But the part of the text containing the byte 0x96 cannot be UTF-8. It's likely that some text was encoded with an 8-bit encoding like CP1252, and then concatenated to the UTF-8 text. This is something that shouldn't happen, because now you have multiple encodings in a single file. Such a file is broken with respect to text encoding.

This is all just guessing, but in my experience, this is the most likely explanation for the scenario you described.

For text with broken encoding, you can use the third-party Python library ftfy: fixes text for you. It will cut your text at every newline character and try to find (guess) the right encoding for each portion. It doesn't magically do the right thing always, but it's pretty good.

To give you more detailed guidance, you'd have to show the code of the script you're calling (if it's your code and you want to fix it).

Hey! Thanks so much for your advice, I decided to try and grep for non UTF8 characters, and I found that it was just way down in my file, and it was exactly what you said was the problem! — A. Alex, May 22 '19 at 05:47

A Utf8 encoded file produces UnicodeDecodeError during parsing

1 Answers1