Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

Question

I have a strange bug when trying to urlopen a certain page from Wikipedia. This is the page:

http://en.wikipedia.org/wiki/OpenCola_(drink)

This is the shell session:

>>> f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
Traceback (most recent call last):
  File "C:\Program Files\Wing IDE 4.0\src\debug\tserver\_sandbox.py", line 1, in <module>
    # Used internally for debug sandbox under external interpreter
  File "c:\Python26\Lib\urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "c:\Python26\Lib\urllib2.py", line 397, in open
    response = meth(req, response)
  File "c:\Python26\Lib\urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "c:\Python26\Lib\urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "c:\Python26\Lib\urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "c:\Python26\Lib\urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

This happened to me on two different systems in different continents. Does anyone have an idea why this happens?

You might want to URL-encode those parentheses. Not that it helps against the 403, though. — Thomas, Jul 26 '10 at 16:07
You could also use links from the wikimedia api https://www.mediawiki.org/wiki/API:Main_page — chackerian, Jun 02 '17 at 15:11

score 138 · Accepted Answer · edited Dec 26 '19 at 13:51

138

Wikipedias stance is:

Data retrieval: Bots may not be used to retrieve bulk content for any use not directly related to an approved bot task. This includes dynamically loading pages from another website, which may result in the website being blacklisted and permanently denied access. If you would like to download bulk content or mirror a project, please do so by downloading or hosting your own copy of our database.

That is why Python is blocked. You're supposed to download data dumps.

Anyways, you can read pages like this in Python 2:

req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib2.urlopen( req )
print con.read()

Or in Python 3:

import urllib
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib.request.urlopen( req )
print(con.read())

edited Dec 26 '19 at 13:51

mischva11

2,811
3
18
34

answered Jul 26 '10 at 16:15

Jochen Ritzel

104,512
31
200
194

"That is why Python is blocked. " I don't get what is this sentence means? However, even I made a list of 'User-Agent' and randomly choose one of them to construct a url, the website will sent me "urllib2.URLError: " or just blocked my ip from visiting their website. Can you give me more ideas? Many thanks. – MaiTiano Mar 06 '12 at 01:52
It's totally ridiculous that they also block `HEAD` request which are useful e.g. to validate all links posted by a user. – ThiefMaster Mar 28 '12 at 16:09
This approach also works for me for a HTTPS page that is returning me a 403. Why does it work, whereas `urllib2.urlopen()` results in a 403? – Pyderman Feb 02 '16 at 00:08
Also, if you are getting error 403 when working with an api you should use the solution described above – Luis Cabrera Benito Jul 16 '17 at 16:13

score 11 · Answer 2 · answered Jul 26 '10 at 16:05

11

To debug this, you'll need to trap that exception.

try:
    f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
except urllib2.HTTPError, e:
    print e.fp.read()

When I print the resulting message, it includes the following

"English

Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes. "

answered Jul 26 '10 at 16:05

S.Lott

384,516
81
508
779

File "", line 3 except urllib2.HTTPError, e: ^ SyntaxError: invalid syntax – KHAN irfan Apr 23 '19 at 05:51
@KHANirfan That's python2 syntax. Python3 exception syntax is "except Exception as e: print(e)". – iHaveNoIdeaWhatImDoing Apr 25 '19 at 01:08

score 5 · Answer 3 · answered Jul 26 '10 at 16:03

Often times websites will filter access by checking if they are being accessed by a recognised user agent. Wikipedia is just treating your script as a bot and rejecting it. Try spoofing as a browser. The following link takes to you an article to show you how.

http://wolfprojects.altervista.org/changeua.php

score 1 · Answer 4 · answered Aug 24 '13 at 07:01

As Jochen Ritzel mentioned, Wikipedia blocks bots.

However, bots will not get blocked if they use the PHP api. To get the Wikipedia page titled "love":

http://en.wikipedia.org/w/api.php?format=json&action=query&titles=love&prop=revisions&rvprop=content

score 1 · Answer 5 · answered Jul 26 '10 at 16:01

1

Some websites will block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends. I don't know and can't imagine why wikipedia does/would do this, but have you tried spoofing your headers?

answered Jul 26 '10 at 16:01

Chris Foster

2,639
2
23
30

score 0 · Answer 6 · answered Dec 18 '17 at 18:38

I made a workaround for this using php which is not blocked by the site I needed.

it can be accessed like this:

path='http://phillippowers.com/redirects/get.php? 
file=http://website_you_need_to_load.com'
req = urllib2.Request(path)
response = urllib2.urlopen(req)
vdata = response.read()

This will return the html code to you

Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

6 Answers6

Linked