0

I want to be able to download full histories of a few thousand articles from http://en.wikipedia.org/wiki/Special:Export and I am looking for a programmatic approach to automate it. I want to save result as XML.

Here is my Wikipedia query. I started the following in Python, but that doesn't get any useful result.

#!/usr/bin/python

import urllib
import codecs

f =  codecs.open('workfile.xml', 'w',"utf-8" )

class AppURLopener(urllib.FancyURLopener):
    version = "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"
urllib._urlopener = AppURLopener()

query = "http://en.wikipedia.org/w/index.php?title=Special:Export&action=submit"
data = { 'catname':'English-language_Indian_films','addcat':'', 'wpDownload':1 }
data = urllib.urlencode(data)
f = urllib.urlopen(query, data)
s = f.read()
print (s)
Cairnarvon
  • 25,981
  • 9
  • 51
  • 65
no_freedom
  • 1,963
  • 10
  • 30
  • 48
  • Why isn't the result useful? What did you expect to get? – ekhumoro Oct 31 '11 at 19:48
  • Please, don't use incorrect user agent unless completely necessary. Wikipedia should work with any non-empty user agent. – svick Oct 31 '11 at 19:54
  • @svick: That's not _completely_ true -- some user agent strings are blacklisted. Annoyingly, that includes e.g. the default libwww-perl user agent string; I wouldn't be surprised to find the default UA string for Python urllib also on the list. – Ilmari Karonen Oct 31 '11 at 21:44
  • @IlmariKaronen, yeah, you're right. But any user agent that you provide yourself to identify your app should be fine. – svick Oct 31 '11 at 21:45
  • @ekhumoro I want to download XML file. – no_freedom Nov 01 '11 at 05:22

1 Answers1

0

I would honestly suggest using Mechanize to get the page, then using lxml or another xml parser to get the information you want. Usually I use the firefox user-agent as many program user-agents are blocked. Note that with Mechanize you can actually fill out the form and "click" enter, then "click" export.

Snakes and Coffee
  • 8,747
  • 4
  • 40
  • 60