How to identify and follow a link, then print data from a new webpage with BeautifulSoup

Question

I am trying to (1) grab a title from a webpage, (2) print the title, (3) follow a link to the next page, (4) grab the title from the next page, and (5) print the title from the next page.

Steps (1) and (4) are the same function and steps (2) and (5) are the same function. The only difference is the functions (4) and (5) are being performed on the next page.

#Imports
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


##Internet
#Link to webpage 
web_page = urlopen("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=31&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/%22deep+learning%22")
#Soup object
soup = BeautifulSoup(web_page, 'html.parser')

I am not having any problems with steps 1 and 2. My code is able to get the title and print it effectively. Steps 1 and 2:

##Get Data
def get_title():
    #Patent Number
    Patent_Number = soup.title.text
    print(Patent_Number)

get_title()

The output I am getting is exactly what I want:

#Print Out
United States Patent: 10530579

I am having trouble with step 3. For step (3), I have been able to identify the right link, but not follow it to the next page. I am identifying the link I want, the 'href' above the image tag.

Picture of link to follow.

The following code is my working draft for steps 3,4, and 5:

#Get
def get_link():
    ##Internet
    #Link to webpage 
    html = urlopen("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=31&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/%22deep+learning%22")
    #Soup object
    soup = BeautifulSoup(html, 'html.parser')
    #Find image
    ##image = <img valign="MIDDLE" src="/netaicon/PTO/nextdoc.gif" border="0" alt="[NEXT_DOC]">
    #image = soup.find("img", valign="MIDDLE")
    image = soup.find("img", valign="MIDDLE", alt="[NEXT_DOC]")
    #Get new link
    new_link = link.attrs['href']
    print(new_link)

get_link()

The output I am getting:

#Print Out
##/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=32&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/"deep+learning"

The output is the exact link I want to follow. In short, the function I am trying to write will open the new_link variable as a new webpage, and perform the same functions performed in (1) and (2) on the new webpage. The resulting output will be two titles instead of one (one for the webpage and one for the new webpage).

In essence, I need to write a:

urlopen(new_link)

function, instead of a:

print(new_link)

function. Then, perform steps 4 and 5 on the new webpage. However, I am having trouble figuring out out to open the new page and grab the title. One problem is that new_link is not a url, but is instead a link I want to click.

score 1 · Answer 1 · answered Feb 25 '20 at 10:22

Although you found the solution just in case someone is trying similar. My solution below is not recommended for all cases. In this case since the url of all the pages differs only by a page number. We can generate these dynamically to then bulk request as below. You can just change the upper range of r until the page exists it will work.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

head = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r="  # no trailing /
trail = """&f=G&l=50&co1=AND&d=PTXT&s1=("deep+learning".CLTX.+or+"deep+learning".DCTX.)&OS=ACLM/"deep+learning"""

final_url = []
news_data = []
for r in range(32,38): #change the upper range as per requirement
    final_url.append(head + str(r) + trail)
for url in final_url:
    try:
        page = urlopen(url)
        soup = BeautifulSoup(page, 'html.parser')   
        patentNumber = soup.title.text
        news_articles = [{'page_url':  url,
                     'patentNumber':  patentNumber}
                    ]
        news_data.extend(news_articles)     
    except Exception as e:
        print(e)
        print("continuing....")
        continue
df =  pd.DataFrame(news_data)

BSH180_44 · Answer 2 · 2020-02-25T20:27:48.907

Rather than print(new_link), this function prints the title from the next page.

def get_link():
    ##Internet
    #Link to webpage 
    html = urlopen("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=31&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/%22deep+learning%22")
    #Soup object
    soup = BeautifulSoup(html, 'html.parser')
    #Find image
    image = soup.find("img", valign="MIDDLE", alt="[NEXT_DOC]")
    #Follow link
    link = image.parent
    new_link = link.attrs['href']
    new_page = urlopen('http://patft.uspto.gov/'+new_link)
    soup = BeautifulSoup(new_page, 'html.parser')
    #Patent Number
    Patent_Number = soup.title.text
    print(Patent_Number)

get_link()

Adding 'http://patft.uspto.gov/' plus the new_link - turned the link to a valid url. Then, I could open the url, navigate to the page and retrieve the title.

score 0 · Answer 3 · answered Feb 25 '20 at 10:20

You can use some regular expression in order to extract and format the link (in case it changes) and the whole sample code follows:

# The first link
url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=31&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/%22deep+learning%22"

# Test loop (to grab 5 records)
for _ in range(5):
    web_page = urlopen(url)
    soup = BeautifulSoup(web_page, 'html.parser')

    # step 1 & 2 - grabbing and printing title from a webpage
    print(soup.title.text) 

    # step 4 - getting the link from the page
    next_page_link = soup.find('img', {'alt':'[NEXT_DOC]'}).find_parent('a').get('href')

    # extracting the link (determining the prefix (http or https) and getting the site data (everything until the first /))
    match = re.compile("(?P<prefix>http(s)?://)(?P<site>[^/]+)(?:.+)").search(url)
    if match:
        prefix = match.group('prefix')
        site = match.group('site')

    # formatting the link to the next page
    url = '%s%s%s' % (prefix, site, next_page_link)

    # printing the link just for debug purpose
    print(url)

    # continuing with the loop

score 0 · Accepted Answer · answered Feb 25 '20 at 10:24

Took the opportunity to clean up your code. I removed the unnecessary import of re and simplified your functions:

from urllib.request import urlopen
from bs4 import BeautifulSoup


def get_soup(web_page):
    web_page = urlopen(web_page)
    return BeautifulSoup(web_page, 'html.parser')

def get_title(soup):
    return soup.title.text  # Patent Number

def get_next_link(soup):
    return soup.find("img", valign="MIDDLE", alt="[NEXT_DOC]").parent['href']

base_url = 'http://patft.uspto.gov'
web_page = base_url + '/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=31&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/%22deep+learning%22'

soup = get_soup(web_page)

get_title(soup)
> 'United States Patent: 10530579'

get_next_link(soup)
> '/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=32&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/"deep+learning"'

soup = get_soup(base_url + get_next_link(soup))
get_title(soup)
> 'United States Patent: 10529534'

get_next_link(soup)
> '/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=33&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/"deep+learning"'

How to identify and follow a link, then print data from a new webpage with BeautifulSoup

4 Answers4