Get text data from a pdf with python

Question

I am stuck with how to deal with pdfs here. I dont know how to scrape directly from the web, and when I download locally they are complete nonsense, not the actual text data.

I have tried to download with requests but the contents is then just useless.

import PyPDF2
#  textract
import requests
# from nltk.tokenize import word_tokenize
# from nltk.corpus import stopwords


def get_amount(url):
  data = requests.get(url)
  with open('/Users/derricdonehoo/code/derric-d/price-processor/exmpl.pdf', 'wb') as f:
    f.write(data.content)

I am trying to figure out how to get data from a pdf. Any suggestons would be greatly appreciated!

What output do you get when you scrape a PDF? "the contents is then just useless" doesn't help. — blackbrandt, Jun 24 '19 at 15:34
@blackbrandt i think it gives me just a binary file or something, whatever is the base of a pdf file — derric-d, Jun 24 '19 at 15:49
``` import PyPDF2; pdf_file = open('sample.pdf'); read_pdf = PyPDF2.PdfFileReader(pdf_file); number_of_pages = read_pdf.getNumPages(); page = read_pdf.getPage(0); page_content = page.extractText(); print page_content; ``` — Ashwin Geet D'Sa, Jun 24 '19 at 15:50
Try to use the code similar to the one above, it may work if the encoding of your file is a suitable one — Ashwin Geet D'Sa, Jun 24 '19 at 15:50
ah i understand, that does work.. i had made some stupid assumptions before. — derric-d, Jun 24 '19 at 16:10

score 1 · Accepted Answer · answered Jun 24 '19 at 16:12

Please modify to below:

import PyPDF2
pdf_file = open('/Users/derricdonehoo/code/derric-d/price-processor/exmpl.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
for i in number_of_pages:
    page = read_pdf.getPage(0)
    page_content = page.extractText()
    print page_content

Get text data from a pdf with python

1 Answers1