What's the best way to extract specific text from Wikipedia's Infobox using BeautifulSoup and lists?

Question

I'm using BeautifulSoup to extract specific text from Wikipedia's Infoboxes (revenue). My code is working if the revenue text is within an 'a' tag. Unfortunately not all pages have their revenues listed within an 'a' tag. Some have their revenue text after 'span' tags, for example. I was wondering what the best / safest way to go about getting the revenue text for a list of companies would be. Would finding another tag in place of 'a' work best? Or something else? Thanks for your help.

company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes']

for c in company:
    r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read()
    soup = BeautifulSoup(r, "lxml")

    rev = re.compile('^Revenue')
    thRev = [e for e in soup.find_all('th', {'scope': 'row'}) if rev.search(e.text)][0]
    tdRev = thRev.find_next('td')
    revenue = tdRev.find_all('a')

    for f in revenue:
        print c + " " + f.text
        break

Yes! Sorry. https://en.wikipedia.org/wiki/Lockheed_Martin, https://en.wikipedia.org/wiki/Phillips_66 — SallyH, May 03 '16 at 22:31
On both of your examples, the revenue isn't inside an `a` tag. — Pedro Lobito, May 03 '16 at 22:34
doesn't wikipedia have an api? You should probably use that instead of scraping — joel goldstick, May 04 '16 at 00:01

Pedro Lobito · Accepted Answer · 2017-07-14T21:42:34.950

You can try:

from bs4 import BeautifulSoup
import urllib
import re
company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes']

for c in company:
    r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read()
    soup = BeautifulSoup(r, "lxml")
    for tr in soup.findAll('tr'):
        trText = tr.text
        if re.search(r"^\bRevenue\b$", trText):
            match = re.search(r"\w+\$(?:\s+)?[\d\.]+.{1}\w+", trText)
            revenue = match.group()
            print c+"\n"+revenue+"\n"

Output:

Lockheed_Martin
US$ 46.132 billion
Phillips_66
US$ 161.21 billion
ConocoPhillips
US$55.52 billion
Sysco
US$44.41 Billion
Baker_Hughes
US$ 22.364 billion

Note: You may want to use Wikipedia API instead, i.e.:

https://en.wikipedia.org/w/api.php?action=query&titles=Baker_Hughes&prop=revisions&rvprop=content&format=json

What's the best way to extract specific text from Wikipedia's Infobox using BeautifulSoup and lists?

1 Answers1