1

I'm trying to get plain(without html/css/special characters/ characters like \n/links/images) text of section using wikipedia api. I trying to do that with this code

import requests

API_URL = 'http://en.wikipedia.org/w/api.php'

def get_section(page, section):
    search_params = {
        'action': 'parse',
        'prop': 'text',
        'pageid': page,
        'section': section,
        'format': 'json'
    }

    response = requests.get(API_URL, params=search_params)

    return response.json()

text = get_section(23862, 2)
print(text['parse']['text']['*']).strip()

It returns this error

UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 5722: character maps to <undefined>

I need to get article sections like article intro using exintro parameter

https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&explaintext&pageids=23862

It returns plain text. Exactly what I need

Levon
  • 41
  • 4

1 Answers1

1

I would suggest to use Pywikibot for this stuff. There is a nice handy pywikibot/data/api.py script you can easily use. Start here: https://www.mediawiki.org/wiki/Manual:Pywikibot/Create_your_own_script and then look into api.py, what options to get the results you want are available.

aleskva
  • 1,644
  • 2
  • 21
  • 40