4

When reading an utf-8 text file in Python you may encounter an illegal utf character. Next you probably will try to find the line (number) containing the illegal character, but probably this will fail. This is illustrated by the code below.

Step 1: Create a file containing an illegal utf-8 character (a1 hex = 161 decimal)

filename=r"D:\wrong_utf8.txt"
longstring = "test just_a_text"*10
with open(filename, "wb") as f:
    for lineno in range(1,100):
        if lineno==85:
            f.write(f"{longstring}\terrrocharacter->".encode('utf-8')+bytes.fromhex('a1')+"\r\n".encode('utf-8'))
        else:
            f.write(f"{longstring}\t{lineno}\r\n".encode('utf-8'))

Step 2: Read the file and catch the error:

print("First pass, regular Python textline read.")
with open(filename, "r",encoding='utf8') as f:
    lineno=0
    while True:
        try:
            lineno+=1
            line=f.readline()
            if not line:
                break
            print(lineno)
        except UnicodeDecodeError:
            print (f"UnicodeDecodeError at line {lineno}\n")
            break

It prints: UnicodeDecodeError at line 50

I would expect the errorline to be line 85. However, lineno 50 is printed! So, the customer who send the file to us was unable to find the illegal character. I tried to find additional parameters to modify the open statement (including buffering) but was unable to get the right error line number.

Note: if you sufficiently shorten the longstring, the problem goes away. So the problem probably has to do with python's internal buffering.

I succeeded by using the following code to find the error line:

print("Second pass, Python byteline read.")
with open(filename,'rb') as f:
    lineno=0
    while True:
        try:
            lineno+=1
            line = f.readline()
            if not line:
                break
            lineutf8=line.decode('utf8')
            print(lineno)
        except UnicodeDecodeError: #Exception as e:
            mybytelist=line.split(b'\t')
            for index,field in enumerate(mybytelist):
                try:
                    fieldutf8=field.decode('utf8')
                except UnicodeDecodeError:
                    print(f'UnicodeDecodeError in line {lineno}, field {index+1}, offending field: {field}!')
                    break
            break

Now it prints the right lineno: UnicodeDecodeError in line 85, field 2, offending field: b'errrocharacter->\xa1\r\n'!

Is this the pythonic way of finding the error line? It works all right but I somehow have the feeling that a better method should be available where it is not required to read the file twice and/or use a binary read.

Cornelis
  • 41
  • 2

2 Answers2

1

The actual cause is indeed the way Python internally processes text files.They are read in chunks, each chunk is decoded according the the specified encoding, and they if you use readline or iterate the file object, the decoded buffer is split in lines which are returned one at a time.

You can have an evidence of that by examining the UnicodeDecodeError object at the time of the error:

    ....
    except UnicodeDecodeError as e:
        print (f"UnicodeDecodeError at line {lineno}\n")
        print(repr(e)) # or err = e to save the object and examine it later
        break

With your example data, you can find that Python was trying to decode a buffer of 8149 bytes, and that the offending character occurs at position 5836 in that buffer.

This processing is deep inside the Python io library because Text files have to be buffered and the binary buffer is decode before being splitted in lines. So IMHO little can be done here, and the best way is probably your second try: read the file as a binary file and decode the lines one at a time.


Alternatively, you could use errors='replace' to replace any offending byte with a REPLACEMENT CHARACTER (U+FFFD). But then, you would no longer test for an error, but search for that character in the line:

with open(filename, "r",encoding='utf8', errors='replace') as f:
    lineno=0
    while True:
        lineno+=1
        line=f.readline()
        if not line:
            break
        if chr(0xfffd) in line:
            print (f"UnicodeDecodeError at line {lineno}\n")
            break
        print(lineno)

This one also gives as expected:

...
80
81
82
83
84
UnicodeDecodeError at line 85
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • The problem with errors='replace' is that the file is send to a classification site where they don't like 0xfffd's in the file. Therefore it is important that user should replace the illegal character with the right one. And for that he/she needs the error line number and the error field. – Cornelis Nov 05 '22 at 15:02
  • 1
    @Cornelis: I have shown how to get the line number of the error... What do you mean exactly by *replace with the right one*? – Serge Ballesta Nov 05 '22 at 16:29
  • Hi Serge, Forgot yesterday to thank you for looking into my question. The user preferably should see the actual illegal character(or the hex value of it). For that character need be replaced by the right character before the output is send to the classification site. Of course your method also works fine and is more easy to apply if the actual illegal character itself is not relevant. – Cornelis Nov 06 '22 at 07:31
0

The UnicodeDecodeError has information about the error that can be used to improve the reporting of the error.

My proposal would be to decode the whole file in one go. If the content is good then there is no need to iterate around a loop. Especially as reading a binary file doesn't have the concept of lines.

If there is an error raised with the decode, then the UnicodeDecodeError has the start and end values of the bad content. Only docoding up to the that bad character allows the lines to be counted efficiently with len and splitlines.

If you want to display the bad line then doing the decode with replace errors set might be useful along with the line number from the previous step.

I would also consider raising a custom exception with the new information.

Here is an example:

from pathlib import Path


def create_bad(filename):
    longstring = "test just_a_text" * 10
    with open(filename, "wb") as f:
        for lineno in range(1, 100):
            if lineno == 85:
                f.write(f"{longstring}\terrrocharacter->".encode('utf-8') + bytes.fromhex('a1') + "\r\n".encode('utf-8'))
            else:
                f.write(f"{longstring}\t{lineno}\r\n".encode('utf-8'))


class BadUnicodeInFile(Exception):
    """Add information about line numbers"""
    pass


def new_read_bad(filename):
    file = Path(filename)
    data = file.read_bytes()
    try:
        file_content = data.decode('utf8')
    except UnicodeDecodeError as err:
        bad_line_no = len(err.object[:err.start].decode('utf8').splitlines())
        bad_line_content = err.object.decode('utf8', 'replace').splitlines()[bad_line_no - 1]
        bad_content = err.object[err.start:err.end]
        raise BadUnicodeInFile(
            f"{filename} has bad content ({bad_content}) on: line number {bad_line_no}\n"
            f"\t{bad_line_content}")
    return file_content


if __name__ == '__main__':
    create_bad("/tmp/wrong_utf8.txt")
    new_read_bad("/tmp/wrong_utf8.txt")

This gave the following output:

Traceback (most recent call last):
  File "/home/user1/stack_overflow/wrong_utf8.py", line 39, in new_read_bad
    file_content = data.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 14028: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user1/stack_overflow/wrong_utf8.py", line 52, in <module>
    new_read_bad("/tmp/wrong_utf8.txt")
  File "/home/user1/stack_overflow/wrong_utf8.py", line 44, in new_read_bad
    raise BadUnicodeInFile(
__main__.BadUnicodeInFile: /tmp/wrong_utf8.txt has bad content (b'\xa1') on: line number 85
    test just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_text    errrocharacter->�

ukBaz
  • 6,985
  • 2
  • 8
  • 31
  • Thanks for your input. I never realized that the error object contains all the information to find the error line. However, I just discovered multiple encoding errors in another file. In that case a binary error detection read is not so bad I think, for then all the errors can be presented to the user at once (by accumulating the errors rather than breaking at the end of the method). – Cornelis Nov 07 '22 at 08:52