Get the content of PDF from a URL with Python 3 without download it

Question

Is there a way to read the first page of a PDF document from a URL without saving it locally? I need to read a request for a PDF document on the website. In the following, you will find the code that I tried to execute. The code works well with some http URLs but not with others.

import urllib3
urllib3.disable_warnings()

with urllib3.PoolManager() as http:
    r = http.request('GET', url) 
    with io.BytesIO(r.data) as f:
        reader = PyPDF2.PdfFileReader(f)
        contents = reader.getPage(0).extractText().split('\n')

Here is the output when I run this code with the following url: "http://www.ain.gouv.fr/IMG/pdf/aprejetdae20210709enligne.pdf"

['', '', '', '', '˘ˇˆ', '˙˝', '˚', '!˛', '˛ ', '', 'ˆ˙ˆ#$%', '$', "#'˙", '( ', '', '', '', '˘ˇˆˇ˙', '˝˘ˇˆˇ˛˚', '˜', ' !"ˇ#ˆ"!$%!"ˇ#&', "ˇ'", '˜', '(', '!"ˇ#ˆˇ!$%!"ˇ#)&*', '˜', '((ˇˇ%!"!ˇ', '+,+-', '(./', '01(', '!,(2$˙', '""˚345', '6', '7((&(1(8', '1ˆ(1((˛.', '˜', '$!"!ˇ(1*(1', '1', ',1˝/9,', '/1(', '˜', '\'%!"!ˇ(1(1', '1,1˝/9,(6', '˜', ')%:(()', '˜', '+,+-(.$!"!ˇ()', '˜', '(!˙%!"!ˇ()5,5,((', '=( ...

Python version : Python 3.10.0

K J · Answer 1 · 2023-03-12T16:34:19.073

Short answer NO (not normally), longer answer MAYBE BUT in a controlled setting.

For your question there are three types of PDF, in common order Non Linearized, Linearized, Custom Streamed. and the custom streaming requires pay for libraries both ends so lets reject that.

When you download the start of a WEB linearized PDF you will see the first page quickly but cant interrogate that page easily unless you save the download as Zer0page.pdf

To enable any viewer to interrogate pages in order you need to download the full objects dictionary which is often at the end of the fully downloaded pdf.

Your example link is to the most common type so "Page 0" address is stored at end of file requiring full download. see here, and note the scrollbar position on the right this is the PDF seen by any editor such as pyton extractor etc. ALL the IMPORTANT DATA for reading and extraction is AT THE END of the Downloaded FILE (into memory or not). It is possible as you see for objects to be in any order, here 12 is before 10 and 45 (the root of the file) is after 11, Thus the first page (HERE in your example, highlighted as 1 0 obj) could be any number and easily be (sometimes is) the last object to download. Normally you don't see a first page until the full progress bar is at the end.

score 0 · Answer 2 · answered Mar 12 '23 at 15:57

LlamaIndex (GPT Index) has a method that uses PyPDF2 to read the pages of a PDF from a file. A new method can be created, similar to that one, that sends bytes to PyPDF instead of reading from a file and then reading just the first page in the for loop.

Link to the method: https://github.com/jerryjliu/gpt_index/blob/c9ee3eb18226c985884f0b1e452207a1c8669b5a/gpt_index/readers/file/docs_parser.py#L12

Modified method:

response = requests.get("http://www.ain.gouv.fr/IMG/pdf/aprejetdae20210709enligne.pdf")
pdf_io_bytes = io.BytesIO(response.content)
text_list = []
pdf = PyPDF2.PdfReader(pdf_io_bytes)

num_pages = len(pdf.pages)

for page in range(num_pages):
    page_text = pdf.pages[page].extract_text()
    text_list.append(page_text)
text = "\n".join(text_list)

The first page returns the text:

"Direction des collectivités\net de l’appui territorial\nBureau de l'aménagement, de l'urbanismeet des installations classéesRéférences : FDS \nArrêté préfectoral portant rejet de la demande d’autorisation environnementale\nd’exploiter une installation de production d'électricité utilisant l'énergie mécanique du vent \npar la société SAS Parc éolien d’Echallon sur la commune d’Echallon\nLa Préfète de l'Ain,\nChevalier de la légion d’honneur,\nVUle code de l’environnement et notamment son titre VIII  - livre I, et en particulier ses articles L.181-9\net  R.181-34 ;\nVUl’ordonnance n° 2017-80 du 26 janvier 2017 relative à l’autorisation environnementale, et notamment\nson article 15 ;\nVUle décret n° 2017-81 du 26 janvier 2017 relatif à l'autorisation environnementale  ;\nVU la demande d’autorisation environnementale présentée en date du 11 janvier 2021 par la SAS Parc\néolien d’Echallon dont le siège social est situé 2 rue André Bonin, 69  004 LYON en vue d’obtenir\nl’autorisation d’exploiter une installation de production d’électricité à partir de l’énergie mécanique du\nvent et regroupant 8 aérogénérateurs et 3 postes de livraison sur la commune d’Echallon  ;\nVUle rapport du 6 avril 2021 de la direction régionale de l’environnement, de l'aménagement et du\nlogement Auvergne-Rhône-Alpes , chargée de l’inspection de l’environnement  ;\nVUle rapport du 5 juillet 2021 de la direction régionale de l’environnement, de l’aménagement et du\nlogement Auvergne-Rhône-Alpes établi suite au contradictoire  ;\nVUla notification au demandeur du projet d’arrêté préfectoral  ;\nVUle courrier de la SAS Parc éolien d’Echallon reçu le 6 mai 2021 en préfecture  ;\nVUla tenue de la réunion en date du 29 juin 2021 en sous-préfecture de NANTUA présidée par\nMadame la sous-préfète de Gex et Nantua  ;\nVUles observations présentées par le demandeur sur le projet d’arrêté lors de la réunion du 29 juin\n2021 ;\nCONSIDÉRANT la demande déposée le 11 janvier 2021  ;\nCONSIDÉRANT que le projet s’inscrit dans un secteur à haute valeur écologique (classement en ZNIEFF\nde type I et  2, zone de présence « de type II » du Grand Tétras, proximité immédiate d’un arrêté de\nprotection de biotope et de zones Natura 2000) ;\nCONSIDÉRANT que la préservation des milieux naturels concernés (hêtraie-sapinière de montagne) est\nnécessaire  au  maintien,  dans  un  état  de  conservation  favorable,  du  cortège  d’espèces  protégées\nassociées (notamment avifaune de montagne dont Chouette de Tengmalm, Chouette chevêchette, Grand\nTétras, rapaces dont Circaète Jean-le-Blanc, Aigle royal et Milan royal, chiroptères dont Minioptère de\nSchreibers) ;\n45 Avenue Alsace-LorraineQuartier Bourg Centre  - CS 80400 - 01012 BOURG EN BRESSE CEDEX Tél. 04.74.32.30.00 - Site internet : www.ain.gouv.fr"

PyPDF2 is deprecated. Use pypdf – Martin Thoma Mar 21 '23 at 23:03 — Martin Thoma, Mar 21 '23 at 23:03

Get the content of PDF from a URL with Python 3 without download it

2 Answers2