I used PyPDF2 to extract text from a PDF file. This is the code I wrote.
import PyPDF2 as pdf
file = open("file_to_scrape.pdf",'rb')
pdf_reader = pdf.PdfFileReader(file)
page_1 = pdf_reader.getPage(0)
print(page_1.extractText())
This gave out the following output.
˜˚Power: An Enabler for Industrialization and Regional Cooperation ˜˚.˜ Introduction
The weird characters behind Power and Introduction are supposed to be numbers, 15 and 15.1 to be precise.
I copied them and tried to encode them to utf-8, but this is what I got.
b'\xcb\x9c\xcb\x9aPower: An Enabler for Industrialization and Regional Cooperation\xcb\x9c\xcb\x9a.\xcb\x9c Introduction'
THis is how the page looks like Could someone please help in figuring out this issue? My aim is to extract list of all figures, headings in the PDF along with their numbering