Why can't I extract the subheading of a page using BeautifulSoup?

Question

I am trying to extract the name and subheading of this page (for example). I have no problem extracting the name, but it's unsuccessful for the subheading. Using inspect element in Chrome, I identified that the subheading text "Canada Census, 1901" is embedded as follows:

<div class="person-info">
    <div class="title ng-binding">Helen Brad in household of Geo Wilcock</div>
    <div class="subhead ng-scope ng-binding" data-ng-if="!recordPersonCentric">Canada Census, 1901</div>

So I coded my script as follows:

import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time

def get_FamSearch():

    link = "https://example.org/pal:/MM9.1.1/KH11-999"
    openLink = urllib2.urlopen(link)
    Soup_FamSearch = BeautifulSoup(openLink, "html")
    openLink.close()

    NameParentTag = Soup_FamSearch.find("tr", class_="result-item highlight-person")
    if NameParentTag:
        Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
        name_decode = Name.encode("ascii", "ignore")
        print name_decode

    SubheadTag = Soup_FamSearch.find("div", class_="subhead ng-scope ng-binding")
    if SubheadTag:
        print SubheadTag.get_text(strip=True)

get_FamSearch()

This is the results, without able to locate and extract the subheading:

Helen Brad
[Finished in 2.2s]

score 2 · Accepted Answer · answered Sep 02 '14 at 19:05

2

The page you are getting via urllib2 doesn't contain the div with subhead class. The actual heading is constructed asynchronously with the help of javascript being executed on the browser-side.

The data you need is presented differently, here's what works for me:

print Soup_FamSearch.find('dt', text='Title').find_next_sibling('dd').text.strip()

Prints:

Canada Census, 1901

answered Sep 02 '14 at 19:05

alecxe

462,703
120
1,088
1,195

Hi Alexce, thanks. Your line of code works fine when it's a valid page, but I would like to scan through many pages, many of which have no subheading (or invalid links) or a different subheading. I want to set a variable to equate that: x = Soup_FamSearch.find('dt', text='Title').find_next_sibling('dd').text.strip(). And then use a loop, something like "if x:" to output only the links that have such subheading. But I get this error: "AttributeError: 'NoneType' object has no attribute 'find_next_sibling'". Looks like it's because BeautifulSoup tries to find it but fail. How should I solve this? – KubiK888 Sep 02 '14 at 19:41
@KubiK888 you can follow the approach you've been using so far: assign a variable to `Soup_FamSearch.find('dt', text='Title')` and check if it is not `None` before getting the `find_next_sibling()`. – alecxe Sep 02 '14 at 19:42

Why can't I extract the subheading of a page using BeautifulSoup?

1 Answers1