0

Scraping PDF data from a website, they changed their PDF formatting so I can no longer use my solution that worked for every other PDF. Unsure of an alternative method.

Hello everyone,

I am trying to pull a PDF from the following website (in the blanks above, specify Registration Number: 08-0714, Reporting Period Year: 2023, Reporting Period Month: 03) and convert the deliveries data into a pandas dataframe (pages 3, 6, and 9) and am repeatedly outputting an empty pandas dataframe. The code below worked for every other PDF in the same category. Has anyone come across this same issue before and has any ideas for me? Note that I do not get an error, just an empty list. Thank you for your help.

import pandas as pd
import tabula as tb

#insert pdf name here. this is the pdf i linked in the question, with just the delivery pages (table says DELIVERIES at the top, should be pages 3,6,9 on website)
df = tb.read_pdf('GrayOak_Deliveries_2023-03.pdf',
            pages="all", 
            area = (0, 0, 600, 1000), 
            columns = [172, 300, 430, 600],
            guess = True, 
            pandas_options={'header': None}, 
            stream = True)

df = pd.concat([df[j] for j in range(len(df))]).reset_index(drop = True)

Please let me know if you have any follow-up questions or if I need to provide more information. Thank you.

jare2620
  • 13
  • 3
  • I get a 35 row dataframe. `[35 rows x 4 columns]` – jqurious Jul 28 '23 at 18:24
  • @jqurious yes that is page 1 of the pdf, but I am looking for pages 3, 6, and 9. For some reason the text on the first page looks scanned in, whereas the pages I am looking for look like they were added in Microsoft Word or something. I am looking to scrape the companies and facility names, not just the raw numbers on the summary page. I edited the PDF so only pages 3,6, and 9 were in it, so thats why I used pages = "all". if you downloaded the entire pdf, that argument would change to pages = [3,6,9] – jare2620 Jul 28 '23 at 19:06
  • Oh right. Yeah, `pages = [3, 6, 9]` returns an empty list - will take another look. – jqurious Jul 28 '23 at 19:25
  • Looks like only page 1 has text, all other pages are images. – jqurious Jul 28 '23 at 19:28
  • @jqurious Thats what I was suspecting. I'll manually pull the data and hope that how that PDF was structured isnt how they upload their PDFs moving forward. Thank you for your help – jare2620 Jul 28 '23 at 20:07
  • `tb` is not defined. what is this ? – D.L Jul 28 '23 at 20:13
  • @D.L yes i forgot to show my libraries. its the tabula library. you need to `import tabula as tb` and `import pandas as pd` for this block of code – jare2620 Jul 28 '23 at 20:47

0 Answers0