3

I am trying to create an automated Python script that goes to a webpage like this, finds the link at the bottom of the body text (anchor text "here"), and downloads the PDF that loads after clicking said download link. I am able to retrieve the HTML from the original and find the download link, but I don't know how to get the link to the PDF from there. Any help would be much appreciated. Here's what I have so far:

import urllib3
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Open page and locate href for bill text
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
links = [] 
for link in soup.findAll('a', href=True, text=['HERE', 'here', 'Here']):
    links.append(link.get('href'))  
links2 = [x for x in links if x is not None]

# Open download link to get PDF
html = urlopen(links2[0])
soup = BeautifulSoup(html, 'html.parser')
links = [] 
for link in soup.findAll('a'):
    links.append(link.get('href'))  
links2 = [x for x in links if x is not None]

At this point the list of links I get does not include the PDF that I am looking for. Is there any way to grab this without hardcoding the link to the PDF in the code (that would be counterintuitive to what I am trying to do here)? Thanks!

  • 1
    You could `requests.get(url, allow_redirects=True)` and see the redirection history with `r.history`. From there, It should be simple to retrieve your pdf. – Camilo Martinez M. Apr 20 '21 at 20:22
  • 2
    Looks like adding a `download=1` parameter to the "here" link gets it to download the PDF for you. `https://www.murphy.senate.gov/download/green-bank-act-2021?download=1`. You will probably have to allow redirects for this to work. – BallpointBen Apr 20 '21 at 20:22
  • 1
    The link that you retrieve with your script is the link that the page links to via href. So it is an issue about the missing redirect (see the other comments). – Oskar Hofmann Apr 20 '21 at 20:30

1 Answers1

2

Looks for the a element with the text here then follows the trail.

import requests
from bs4 import BeautifulSoup

url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'

user_agent = {'User-agent': 'Mozilla/5.0'}

s = requests.Session()

r = s.get(url, headers=user_agent)
soup = BeautifulSoup(r.content, 'html.parser')
for a in soup.select('a'):
    if a.text == 'here':
        href = a['href']
        r = s.get(href, headers=user_agent)
        print(r.status_code, r.reason)
        print(r.headers)
        _, dl_url = r.headers['refresh'].split('url=', 1)
        r = s.get(dl_url, headers=user_agent)
        print(r.status_code, r.reason)
        print(r.headers)
        file_bytes = r.content # here's your PDF; you can write it out to a file