download articles from wikipedia using special export

Question

I want to be able to download full histories of a few thousand articles from http://en.wikipedia.org/wiki/Special:Export and I am looking for a programmatic approach to automate it. I want to save result as XML.

Here is my Wikipedia query. I started the following in Python, but that doesn't get any useful result.

#!/usr/bin/python

import urllib
import codecs

f =  codecs.open('workfile.xml', 'w',"utf-8" )

class AppURLopener(urllib.FancyURLopener):
    version = "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"
urllib._urlopener = AppURLopener()

query = "http://en.wikipedia.org/w/index.php?title=Special:Export&action=submit"
data = { 'catname':'English-language_Indian_films','addcat':'', 'wpDownload':1 }
data = urllib.urlencode(data)
f = urllib.urlopen(query, data)
s = f.read()
print (s)

Please, don't use incorrect user agent unless completely necessary. Wikipedia should work with any non-empty user agent. — svick, Oct 31 '11 at 19:54
@svick: That's not _completely_ true -- some user agent strings are blacklisted. Annoyingly, that includes e.g. the default libwww-perl user agent string; I wouldn't be surprised to find the default UA string for Python urllib also on the list. — Ilmari Karonen, Oct 31 '11 at 21:44
@IlmariKaronen, yeah, you're right. But any user agent that you provide yourself to identify your app should be fine. — svick, Oct 31 '11 at 21:45

score 0 · Answer 1 · answered Mar 06 '12 at 06:40

I would honestly suggest using Mechanize to get the page, then using lxml or another xml parser to get the information you want. Usually I use the firefox user-agent as many program user-agents are blocked. Note that with Mechanize you can actually fill out the form and "click" enter, then "click" export.

download articles from wikipedia using special export

1 Answers1