1

I used PyPDF2 to extract text from a PDF file. This is the code I wrote.

import PyPDF2  as pdf
file = open("file_to_scrape.pdf",'rb')
pdf_reader = pdf.PdfFileReader(file)
page_1 = pdf_reader.getPage(0)
print(page_1.extractText())

This gave out the following output.

˜˚Power: An Enabler for Industrialization and Regional Cooperation ˜˚.˜ Introduction

The weird characters behind Power and Introduction are supposed to be numbers, 15 and 15.1 to be precise.

I copied them and tried to encode them to utf-8, but this is what I got.

b'\xcb\x9c\xcb\x9aPower: An Enabler for Industrialization and Regional Cooperation\xcb\x9c\xcb\x9a.\xcb\x9c Introduction'

THis is how the page looks like Could someone please help in figuring out this issue? My aim is to extract list of all figures, headings in the PDF along with their numbering

  • Try to see this related topic https://stackoverflow.com/questions/4203414/pypdf-unable-to-extract-text-from-some-pages-in-my-pdf See this answer https://stackoverflow.com/a/4203729/11777096 – NASSIM ADRAO Jan 26 '22 at 13:06
  • 2
    Why are you converting them to a different encoding in the first place? If the PDF extractor can't extract the information then you can't hope to conjure it out of thin air. – tripleee Jan 26 '22 at 13:11

0 Answers0