When reading an utf-8 text file in Python you may encounter an illegal utf character. Next you probably will try to find the line (number) containing the illegal character, but probably this will fail. This is illustrated by the code below.
Step 1: Create a file containing an illegal utf-8 character (a1 hex = 161 decimal)
filename=r"D:\wrong_utf8.txt"
longstring = "test just_a_text"*10
with open(filename, "wb") as f:
for lineno in range(1,100):
if lineno==85:
f.write(f"{longstring}\terrrocharacter->".encode('utf-8')+bytes.fromhex('a1')+"\r\n".encode('utf-8'))
else:
f.write(f"{longstring}\t{lineno}\r\n".encode('utf-8'))
Step 2: Read the file and catch the error:
print("First pass, regular Python textline read.")
with open(filename, "r",encoding='utf8') as f:
lineno=0
while True:
try:
lineno+=1
line=f.readline()
if not line:
break
print(lineno)
except UnicodeDecodeError:
print (f"UnicodeDecodeError at line {lineno}\n")
break
It prints: UnicodeDecodeError at line 50
I would expect the errorline to be line 85. However, lineno 50 is printed! So, the customer who send the file to us was unable to find the illegal character. I tried to find additional parameters to modify the open statement (including buffering) but was unable to get the right error line number.
Note: if you sufficiently shorten the longstring, the problem goes away. So the problem probably has to do with python's internal buffering.
I succeeded by using the following code to find the error line:
print("Second pass, Python byteline read.")
with open(filename,'rb') as f:
lineno=0
while True:
try:
lineno+=1
line = f.readline()
if not line:
break
lineutf8=line.decode('utf8')
print(lineno)
except UnicodeDecodeError: #Exception as e:
mybytelist=line.split(b'\t')
for index,field in enumerate(mybytelist):
try:
fieldutf8=field.decode('utf8')
except UnicodeDecodeError:
print(f'UnicodeDecodeError in line {lineno}, field {index+1}, offending field: {field}!')
break
break
Now it prints the right lineno: UnicodeDecodeError in line 85, field 2, offending field: b'errrocharacter->\xa1\r\n'!
Is this the pythonic way of finding the error line? It works all right but I somehow have the feeling that a better method should be available where it is not required to read the file twice and/or use a binary read.