how to read a pdf inside a website without downloading the file

Question

I want to get the data from a pdf that its inside a website, i have tried with tabula but it gave me the following error:

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\\Users\\Hector\\anaconda3\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--guess', '--format', 'JSON', 'C:\\Users\\Hector\\AppData\\Local\\Temp\\3231632d-81cd-4914-b5e9-cc12f03b607e.pdf']' returned non-zero exit status 1.

enter image description here

from tabula import read_pdf

df = read_pdf("url")

score 0 · Answer 1 · answered Jan 14 '23 at 02:45

You can try io.BytesIO. For example:

# import io, tabula
pdfUrl = 'https://www.premera.com/documents/052166_2023.pdf'
r = requests.get(pdfUrl, headers={'user-agent': 'Mozilla/5.0'})
r.raise_for_status()
tabula.read_pdf(io.BytesIO(r.content), pages='6')[0]

I'm not sure the page link will work as pdfUrl since it's embedded, so if tabula.io raises some error, try the download link of the pdf instead.

how to read a pdf inside a website without downloading the file

1 Answers1