Scraping PDF data from a website, they changed their PDF formatting so I can no longer use my solution that worked for every other PDF. Unsure of an alternative method.
Hello everyone,
I am trying to pull a PDF from the following website (in the blanks above, specify Registration Number: 08-0714, Reporting Period Year: 2023, Reporting Period Month: 03) and convert the deliveries data into a pandas dataframe (pages 3, 6, and 9) and am repeatedly outputting an empty pandas dataframe. The code below worked for every other PDF in the same category. Has anyone come across this same issue before and has any ideas for me? Note that I do not get an error, just an empty list. Thank you for your help.
import pandas as pd
import tabula as tb
#insert pdf name here. this is the pdf i linked in the question, with just the delivery pages (table says DELIVERIES at the top, should be pages 3,6,9 on website)
df = tb.read_pdf('GrayOak_Deliveries_2023-03.pdf',
pages="all",
area = (0, 0, 600, 1000),
columns = [172, 300, 430, 600],
guess = True,
pandas_options={'header': None},
stream = True)
df = pd.concat([df[j] for j in range(len(df))]).reset_index(drop = True)
Please let me know if you have any follow-up questions or if I need to provide more information. Thank you.