I want grab the age, place of birth and previous occupation of senators. Information for each individual senator is available on Wikipedia, on their respective pages, and there is another page with a table that lists all senators by the name. How can I go through that list, follow links to the respective pages of each senator, and grab the information I want?
Here is what I've done so far.
1 . (no python) Found out that DBpedia exists and wrote a query to search for senators. Unfortunately DBpedia hasn't categorized most (if any) of them:
SELECT ?senator, ?country WHERE { ?senator rdf:type <http://dbpedia.org/ontology/Senator> . ?senator <http://dbpedia.org/ontology/nationality> ?country }
Query results are unsatisfactory.
2 . Found out that there is a python module called wikipedia
that allows me to search and retrieve information from individual wiki pages. Used it to get a list of senator names from the table by looking at the hyperlinks.
import wikipedia as w
w.set_lang('pt')
# Grab page with table of senator names.
s = w.page(w.search('Lista de Senadores do Brasil da 55 legislatura')[0])
# Get links to senator names by removing links of no interest
# For each link in the page, check if it's a link to a senator page.
senators = [name for name in s.links if not
# Senator names don't contain digits nor ,
(any(char.isdigit() or char == ',' for char in name) or
# And full names always contain spaces.
' ' not in name)]
At this point I'm a bit lost. Here the list senators
contains all senator names, but also other names, e.g., party names. The wikipidia
module (at least from what I could find in the API documentation) also doesn't implement functionality to follow links or search through tables.
I've seen two related entries here on StackOverflow that seem helpful, but they both (here and here) extract information from a single page.
Can anyone point me towards a solution?
Thanks!