0

I'm trying to get all the past revisions (edits) on a certain Wikipedia article using the MediaWIki API. This code should retrieve all the edits made on the FDR Wikipedia page. Here is the code that I wrote in order to do this:

import re
import requests

def GetRevisions():
    url = "https://en.wikipedia.org/w/api.php?action=query=Franklin%20Delano%20Roosevelt=revisions&rvlimit=500&titles=" 

    while True:
        joan = requests.get(url)
        revisions = []                                        
        revisions += re.findall('<continue rvcontinue="([^"]+)"',joan)

        cont = re.search('<continue rvcontinue="([^"]+)"',joan)
        if not cont:
            break
    return revisions

The problem that I keep running into is this error: TypeError: expected string or buffer ` I'm not sure why this error keeps on showing up. Can anyone please give guidance on how to remedy this?

Jan
  • 42,290
  • 8
  • 54
  • 79
dabberson567
  • 43
  • 2
  • 2
  • 11

1 Answers1

2
re.findall('<continue rvcontinue="([^"]+)"',joan)

joan (who is Joan??) is a request object, not a string. You can't apply regular expressions to it.

Additionally, the MediaWiki API URL you're using is malformed. It returns an error, not the data you're looking for.

You can avoid the problem entirely by requesting a JSON response from the MediaWiki API (format=json) and parsing it using .json(), as seen below. Note that I'm using a dictionary to pass parameters to the API -- this means we don't have to escape query strings, and makes it easier to update the query with continue parameters…

url = "https://en.wikipedia.org/w/api.php"
query = {
        "format": "json",
        "action": "query",
        "titles": "Franklin Delano Roosevelt",
        "prop": "revisions",
        "rvlimit": 500,
        }

while True:
    r = requests.get(url, params=query).json()
    print repr(r) # Insert your own code to parse the response here
    if 'continue' in r:
        query.update(r['continue'])
    else:
        break