How to automate scraping wikipedia info box specifically and print the data using python for any wiki page?

Question

My task is to automate printing the wikipedia infobox data.As an example, I am scraping the Star Trek wikipedia page (https://en.wikipedia.org/wiki/Star_Trek) and extract infobox section from the right hand side and print them row by row on screen using python. I specifically want the info box. So far I have done this:

from bs4 import BeautifulSoup
import urllib.request
# specify the url
urlpage =  'https://en.wikipedia.org/wiki/Star_Trek'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
# find results within table
table = soup.find('table', attrs={'class': 'infobox vevent'})
results = table.find_all('tr')
print(type(results))
print('Number of results', len(results))
print(results)

This gives me everything from the info box. A snippet is shown below:

[<tr><th class="summary" colspan="2" style="text-align:center;font- 
size:125%;font-weight:bold;font-style: italic; background: lavender;"> 
<i>Star Trek</i></th></tr>, <tr><td colspan="2" style="text-align:center">
<a class="image" href="/wiki/File:Star_Trek_TOS_logo.svg"><img alt="Star 
Trek TOS logo.svg" data-file-height="132" data-file-width="560" height="59"

I want to extract the data only and print it on screen. So What i want is:

Created by  Gene Roddenberry
Original work   Star Trek: The Original Series
Print publications
Book(s) 
List of reference books
List of technical manuals
Novel(s)    List of novels
Comics  List of comics
Magazine(s) 
Star Trek: The Magazine
Star Trek Magazine

And so on till the end of the infobox. So basically a way of printing every row of the infobox data so I can automate it for any wiki page? (The class of infobox table of all wiki pages is 'infobox vevent' as shown in the code)

What is wrong with parsing the content of the info box as well? — Nearoo, Oct 21 '18 at 09:50

score 0 · Answer 1 · answered Oct 21 '18 at 10:01

This page should help you to parse your html as a simple string without the html tags Using BeautifulSoup Extract Text without Tags

This is a code from that page, it belongs to @0605002

>>> html = """
<p>
    <strong class="offender">YOB:</strong> 1987<br />
    <strong class="offender">RACE:</strong> WHITE<br />
    <strong class="offender">GENDER:</strong> FEMALE<br />
    <strong class="offender">HEIGHT:</strong> 5'05''<br />
    <strong class="offender">WEIGHT:</strong> 118<br />
    <strong class="offender">EYE COLOR:</strong> GREEN<br />
    <strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text


YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN

score 0 · Answer 2 · answered Oct 22 '18 at 04:01

By using beautifulsoup,you need to reformat the data as you want. use fresult = [e.text for e in result] to get each result

If you want to read a table on html you can try some code like this,though this is using pandas.

import pandas
urlpage =  'https://en.wikipedia.org/wiki/Star_Trek'
data = pandas.read_html(urlpage)[0]
null = data.isnull()

for x in range(len(data)):
    first = data.iloc[x][0]
    second = data.iloc[x][1] if not null.iloc[x][1] else ""
    print(first,second,"\n")

How to automate scraping wikipedia info box specifically and print the data using python for any wiki page?

2 Answers2