I'm trying to reformat a text file so I can upload it to a pipeline (QIIME2) - I tested the first few lines of my .txt file (but it is tab separated), and the conversion was successful. However, when I try to run the script on the whole file, I encounter an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 16: invalid start byte
I have identified that the file encoding is Utf8, so I am not sure where the problem is arising from.
$ file filename.txt
filename: UTF-8 Unicode text, with very long lines, with CRLF line terminator
I have also reviewed some of the lines that are associated with the error, and I am not able to visually identify any unorthodox characters.
I have tried to force encode it using:
$iconv -f UTF8 -t UTF8 filename.txt > new_file.txt
However, the error produced is:
iconv: illegal input sequence at position 152683
How I'm understanding this is that whatever character occurs at the position is not readable/translatable using utf-8 encoding, but I am not sure then why the file is said to be encoded in utf-8.
I am running this on Linux, and the data itself are sequence information from the BOLD database (if anyone else has run into similar problems when trying to convert this into a format appropriate for QIIME2).