How to scrape PDFs using Python; specific content only

Question

I am trying to get data from PDFs available on the site

https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en

For example, If I look at November 2019 report

https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/dz011445t/mg74r196p/latest.pdf

I need the data on Page 12 for corns, I have to create separate files for ending stocks, exports etc. I am new to Python and I am not sure how to scrape the content separately. If I can figure it out for one month then I can create a loop. But, I am confused on how to proceed for one file.

Can someone help me out here, TIA.

if page sends all in one PDF then you will have to download this file and later use other modules to get data from PDF. But these modules have nothing to do with 'scraping'. They are describe by word `edit` or `extract`. — furas, Dec 01 '19 at 22:56
I checked this page and I see links to files txt, xls, xml - it would be easier to get txt file and work with text - eventually with xml or xls. — furas, Dec 01 '19 at 23:00
Actually they do not have text files for all the years, that's why I was thinking to extract from PDFs — Camilia, Dec 01 '19 at 23:05
using `requests` or `urllib` you can get HTML from server, using `BeautifulSoup` you can find links to PDF in HTML, using these links with `requests` or `urllib` you can download PDF. Later you would have to use other tools to work with PDF. There are modules `PDFMiner`, `PyPDF2` to work with PDF in Python but I don't have experience with this. — furas, Dec 01 '19 at 23:27

score 7 · Answer 1 · answered Dec 02 '19 at 00:03

Here a little example using PyPDF2 ,requests and BeautifulSoup ...pls check the notes comment , this is for first block ...if you need more is necesary change the value in url variable

# You need install :
# pip install PyPDF2 - > Read and parse your content pdf
# pip install requests - > request for get the pdf
# pip install BeautifulSoup - > for parse the html and find all url hrf with ".pdf" final
from PyPDF2 import PdfFileReader
import requests
import io
from bs4 import BeautifulSoup

url=requests.get('https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en#release-items')
soup = BeautifulSoup(url.content,"lxml")

for a in soup.find_all('a', href=True):
    mystr= a['href']
    if(mystr[-4:]=='.pdf'):
        print ("url with pdf final:", a['href'])
        urlpdf = a['href']
        response = requests.get(urlpdf)
        with io.BytesIO(response.content) as f:
            pdf = PdfFileReader(f)
            information = pdf.getDocumentInfo()
            number_of_pages = pdf.getNumPages()
            txt = f"""
            Author: {information.author}
            Creator: {information.creator}
            Producer: {information.producer}
            Subject: {information.subject}
            Title: {information.title}
            Number of pages: {number_of_pages}
            """
            # Here the metadata of your pdf
            print(txt)
            # numpage for the number page
            numpage=20
            page = pdf.getPage(numpage)
            page_content = page.extractText()
            # print the content in the page 20            
            print(page_content)

score 1 · Answer 2 · answered Dec 01 '19 at 23:09

I would recommend Beautiful Soup if you need to scrape data from a website ,but it looks like you are going to need OCR for extracting the data from the PDF. There is something called pytesseract. Look into that and the tutorials and you should be set.

score 0 · Answer 3 · answered Dec 02 '19 at 15:38

Try pdfreader. You can extract the tables as PDF markdown containing decoded text strings and parse then as plain texts.


from pdfreader import SimplePDFViewer
fd = open("latest.pdf","rb")
viewer = SimplePDFViewer(fd)
viewer.navigate(12)
viewer.render()
markdown = viewer.canvas.text_content

markdown variable contains all texts including PDF commands (positioning, display): all strings come in brackets followed by Tj or TJ operator. For more on PDF text operators see PDF 1.7 sec. 9.4 Text Objects

You can parse it with regular expressions for example.

How to scrape PDFs using Python; specific content only

3 Answers3