'utf-8' codec can't decode byte 0xe2 : invalid continuation byte error

Question

I am trying to read all PDF files from a folder to look for a number using regular expression. On inspection, the charset for PDFs is 'UTF-8'.

Throws this error:

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Tried reading in binary mode, tried Latin-1 encoding, but it shows all special characters so nothing shows up in search.

import os
import re
import pandas as pd
download_file_path = "C:\\Users\\...\\..\\"
for file_name in os.listdir(download_file_path):
    try:
        with open(download_file_path + file_name, 'r',encoding="UTF-8") as f:
          s = f.read()
          re_api = re.compile("API No\.\:\n(.*)")
          api = re_api.search(s).group(1).split('"')[0].strip()
          print(api)
    except Exception as e:
        print(e)

Expecting to find API number from PDF files

'NoneType' object has no attribute 'group', I get this error when I tried it — Prat, Jun 05 '19 at 03:29
*"On inspection, the charset for PDFs is 'UTF-8'."* - Nope, pdf is a binary format, usually containing much compressed data. And even uncompressed string data in it can occur in a mix of encodings, hardly ever utf-8. — mkl, Jun 05 '19 at 04:43

ASHu2 · Answer 1 · 2019-06-18T10:42:28.477

PDF files are stored as bytes. Therefore to read or write a PDF file you need to use rb or wb.

with open(file, 'rb') as fopen:
    q = fopen.read()
    print(q.decode())

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte might occur because of your editor or the PDF is not utf encoded(generally).

Therefore use ,

with open(file, 'rb') as fopen:
        q = fopen.read()
        print(q.decode('latin-1')) #or any encoding which is suitable here.

If your editor console is incompatible then also you wont be able to see any output.

A NOTE : you can't use encoding param while using rb so you have to decode after reading the file.

This worked for me. And not even just PDFs, I was getting DecodeError exception for images as well. — Somendra Meena, Oct 11 '22 at 06:14

tripleee · Answer 2 · 2019-06-05T06:52:44.167

When you open a file with open(..., 'r', encoding='utf-8') you are basically guaranteeing that this is a text file containing no bytes which are not UTF-8. But of course, this guarantee cannot hold for a PDF file - it is a binary format which may or may not contain strings in UTF-8. But that's not how you read it.

If you have access to a library which reads PDF and extracts text strings, you could do

# Dunno if such a library exists, but bear with ...
instance = myFantasyPDFlibrary('file.pdf')
for text_snippet in instance.enumerate_texts_in_PDF():
    if 'API No.:\n' in text_snippet:
        api = text_snippet.split('API No.:\n')[1].split('\n')[0].split('"')[0].strip()

More realistically, but in a more pedestrian fashion, you could read the PDF file as a binary file, and look for the encoded text.

with open('file.pdf', 'rb') as pdf:
    pdfbytes = pdf.read()
if b'API No.:\n' in pdfbytes:
    api_text = pdfbytes.split(b'API No.:\n')[1].split(b'\n')[0].decode('utf-8')
    api = api_text.split('"')[0].strip()

A crude workaround is to lie to Python about the encoding, and claim that it's actually Latin-1. This particular encoding has the attractive feature that every byte maps exactly to its own Unicode code point, so you can read binary data as text and get away with it. But then, of course, any actual UTF-8 will be converted to mojibake (so "hëlló" will render as "hÃ«llÃ³" for example). You can extract actual UTF-8 text by converting the text back to bytes and then decoding it with the correct encoding (latintext.encode('latin-1').decode('utf-8')).

score 3 · Answer 3 · answered Sep 08 '21 at 09:06

3

Just switch to a a different codec packag: encoding = 'unicode_escape'

answered Sep 08 '21 at 09:06

team meryb

65
3

score 0 · Answer 4 · answered Mar 21 '20 at 22:45

0

The problem may be due to your computer name, I got this error in Python Django framework

Solution is "Your computer name must not contain special characters", Plase check and change your computer name...Changing computer name image

answered Mar 21 '20 at 22:45

Yuksel CELIK

113
3

'utf-8' codec can't decode byte 0xe2 : invalid continuation byte error

4 Answers4

Linked