python retrieve text from multiple random wikipedia pages

Question

I am using python 2.7 with wikipedia package to retrieve the text from multiple random wikipedia pages as explained in the docs.

I use the following code

def get_random_pages_summary(pages = 0):
    import wikipedia
    page_names = [wikipedia.random(1) for i in range(pages)]
    return [[p,wikipedia.page(p).summary] for p in page_names]

text =  get_random_pages_summary(50)

and get the following error

File "/home/user/.local/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to) wikipedia.exceptions.DisambiguationError: "Priuralsky" may refer to: Priuralsky District Priuralsky (rural locality)

what i am trying to do is to get the text. from random pages in Wikipedia, and I need it to be just regular text, without any markdown

I assume that the problem is getting a random name that has more than one option when searching for a Wikipedia page. when i use it to get one Wikipedia page. it works well.

Thanks

Banana · Answer 1 · 2017-05-15T08:05:57.150

As you're doing it for random articles and with a Wikipedia API (not directly pulling the HTML with different tools) my suggestion would be to catch the DisambiguationError and re-random article in case this happens.

def random_page():
   random = wikipedia.random(1)
   try:
       result = wikipedia.page(random).summary
   except wikipedia.exceptions.DisambiguationError as e:
       result = random_page()
   return result

tell k · Accepted Answer · 2017-05-15T14:31:51.940

2

According to the document(http://wikipedia.readthedocs.io/en/latest/quickstart.html) the error will return multiple page candidates so you need to search that candidate again.

try:
    wikipedia.summary("Priuralsky")
except wikipedia.exceptions.DisambiguationError as e:
    for page_name in e.options:
        print(page_name)
        print(wikipedia.page(page_name).summary)

You can improve your code like this.

import wikipedia

def get_page_sumarries(page_name):
    try:
        return [[page_name, wikipedia.page(page_name).summary]]
    except wikipedia.exceptions.DisambiguationError as e:
        return [[p, wikipedia.page(p).summary] for p in e.options]

def get_random_pages_summary(pages=0):
    ret = []
    page_names = [wikipedia.random(1) for i in range(pages)]
    for p in page_names:
        for page_summary in get_page_sumarries(p):
            ret.append(page_summary)
    return  ret

text = get_random_pages_summary(50)

edited May 15 '17 at 14:31

answered May 15 '17 at 07:48

tell k

605
2
7
18

i am still getting an error - line 393, in __load raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to) wikipedia.exceptions.DisambiguationError: "Churn" may refer to: Butter churn Churning (butter) Milk churn Churn drill Chuck Churn River Churn Churn Creek Churn Creek Protected Area Devils Churn Churn railway station Churn (Shihad album) Churn (Seven Mary Three album) Churn (band) Product churning Churning (stock trade) Churn rate Churning (cipher) – thebeancounter May 15 '17 at 12:31
found it! the problem with your code was that sometimes when finding more than one option and searching for it yields more than one option also, i solved it with running the get_page_summary function again over each option in e.options – thebeancounter May 15 '17 at 12:56
I fixed my code. Anyway, it was good that you could solve it. – tell k May 15 '17 at 14:35
thanks! I found something else now, the problem is that when getting more than one value, and looking for all the values for each value, it becomes recursive, and again giving more than one solution including the one that we had before e.g sometimes you get many repetitions of the same Wikipedia value – thebeancounter May 17 '17 at 07:03

python retrieve text from multiple random wikipedia pages

2 Answers2