0

I am using Tabula-py to download and extract tables from PDFs via a list of URLs. The URLs are created based on rules and everything is working fine except when Tabula tries to process a PDF from a link with no page/file (specifically weekends as PDFs aren't published on weekends).

Full Python script below.

I want the script to skip any errors it comes into (specifically when attempt to pull from a weekend based URL) and continue processing.

Any ideas?

import datetime
import pickle

import pandas
import tabula

# create text file

df=open('urls.txt','w')



# Example list

start = datetime.datetime(2022, 11, 1)
end = datetime.datetime(2022, 11, 11)
delta = datetime.timedelta(days=1)

pdf_path='https://www.irishprisons.ie/wp-content/uploads/documents_pdf/{date1:%d-%B-%Y}.pdf'

while start < end:
    date1 = start
    date2 = start + delta
    url = pdf_path.format(date1=date1, date2=date2)


# Save list and stop loop
    df.write(url)
    start = date2  

# Extract Table from PDF availible from url

    path = url
    # Make the most recent
    #path = "https://www.irishprisons.ie/wp-content/uploads/documents_pdf/11-November-2022.pdf"

    dfs = tabula.read_pdf(path, pages='1', lattice=True, stream=True, pandas_options={'header':None})


    try:
        new_header = dfs[0].iloc[1]
        inmate_count = dfs[0].drop(labels=0, axis=0)
        inmate_count.columns = [new_header]
        inmate_count=inmate_count.dropna(how='all').reset_index(drop=True)
        inmate_count = inmate_count.drop(labels=[0], axis=0)
        inmate_count['url'] = path
        inmate_count.to_csv("first_table.csv", mode='a', header=False, index=False)
        print(inmate_count)
    except  Exception:
        pass

print("Finished")

I've tried but am unfamiliar with try/exception, but that doesn't seem to do anything.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
  • If this code does not do what you want, then show us the output, and explain how that is different from what you wanted. – John Gordon Nov 13 '22 at 14:58
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Nov 13 '22 at 15:00
  • If you want to skip errors that are related to fetching the url, then it seems like the call to `read_pdf()` belongs inside the try/except block... – John Gordon Nov 13 '22 at 15:01

1 Answers1

0

You can write separate try/catches for each independent functions so the others will continue:

try:
  foo = func1()
  foo.func2()
except Exception:
  print("this failed")

try:
  mom = func3()
except Exception:
  print("this failed")

try:
  func4()
except Exception:
  print("this failed")
Glaucon
  • 26
  • 4