I am trying to create an automated Python script that goes to a webpage like this, finds the link at the bottom of the body text (anchor text "here"), and downloads the PDF that loads after clicking said download link. I am able to retrieve the HTML from the original and find the download link, but I don't know how to get the link to the PDF from there. Any help would be much appreciated. Here's what I have so far:
import urllib3
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Open page and locate href for bill text
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.findAll('a', href=True, text=['HERE', 'here', 'Here']):
links.append(link.get('href'))
links2 = [x for x in links if x is not None]
# Open download link to get PDF
html = urlopen(links2[0])
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
links2 = [x for x in links if x is not None]
At this point the list of links I get does not include the PDF that I am looking for. Is there any way to grab this without hardcoding the link to the PDF in the code (that would be counterintuitive to what I am trying to do here)? Thanks!