How to resolve UTF-8 encoding issues in python strings while extracting data from pdf?

Asked Jan 26 '22 at 12:25

Active Jan 26 '22 at 13:30

Viewed 1,210 times

I used PyPDF2 to extract text from a PDF file. This is the code I wrote.

import PyPDF2  as pdf
file = open("file_to_scrape.pdf",'rb')
pdf_reader = pdf.PdfFileReader(file)
page_1 = pdf_reader.getPage(0)
print(page_1.extractText())

This gave out the following output.

˜˚Power: An Enabler for Industrialization and Regional Cooperation ˜˚.˜ Introduction

The weird characters behind Power and Introduction are supposed to be numbers, 15 and 15.1 to be precise.

I copied them and tried to encode them to utf-8, but this is what I got.

b'\xcb\x9c\xcb\x9aPower: An Enabler for Industrialization and Regional Cooperation\xcb\x9c\xcb\x9a.\xcb\x9c Introduction'

THis is how the page looks like Could someone please help in figuring out this issue? My aim is to extract list of all figures, headings in the PDF along with their numbering

edited Jan 26 '22 at 12:34

asked Jan 26 '22 at 12:25

Sehal Hasan

Try to see this related topic https://stackoverflow.com/questions/4203414/pypdf-unable-to-extract-text-from-some-pages-in-my-pdf See this answer https://stackoverflow.com/a/4203729/11777096 – NASSIM ADRAO Jan 26 '22 at 13:06
2

Why are you converting them to a different encoding in the first place? If the PDF extractor can't extract the information then you can't hope to conjure it out of thin air. – tripleee Jan 26 '22 at 13:11

How to resolve UTF-8 encoding issues in python strings while extracting data from pdf?

0 Answers0