I'm using BeautifulSoup to extract specific text from Wikipedia's Infoboxes (revenue). My code is working if the revenue text is within an 'a' tag. Unfortunately not all pages have their revenues listed within an 'a' tag. Some have their revenue text after 'span' tags, for example. I was wondering what the best / safest way to go about getting the revenue text for a list of companies would be. Would finding another tag in place of 'a' work best? Or something else? Thanks for your help.
company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes']
for c in company:
r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read()
soup = BeautifulSoup(r, "lxml")
rev = re.compile('^Revenue')
thRev = [e for e in soup.find_all('th', {'scope': 'row'}) if rev.search(e.text)][0]
tdRev = thRev.find_next('td')
revenue = tdRev.find_all('a')
for f in revenue:
print c + " " + f.text
break