0

I am facing a strange issue, whenever I try to download an image using urllib.retrieve, it never returns back, and terminal just stays busy waiting for response, which never comes back.

Code

resp = urllib2.urlopen("http://charlesngo.com/wp-content/uploads/2015/11/rat-race-full-res-1030x728.jpg")
codeomnitrix
  • 4,179
  • 19
  • 66
  • 102
  • Are you using `urllib2.urlopen` or `urllib.urlretrieve`? Your question says one, your example the other. Please edit your question to refer to the right function consistently. – snakecharmerb Apr 01 '16 at 17:35
  • Oh thats a typo, however I tried both the alternatives, both are not working – codeomnitrix Apr 02 '16 at 09:55

1 Answers1

1

The server is rejecting your request because it detects that you are fetching the image from a Python script by inspecting the user agent header on the request. You can add a different user agent header to override the default and mimic a request from a browser.

>>> import urllib2
>>> url = "http://charlesngo.com/wp-content/uploads/2015/11/rat-race-full-res-1030x728.jpg"
>>> req = urllib2.Request(url)
>>> resp = urllib2.urlopen(req)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/usr/lib64/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>> req = urllib2.Request(url)
>>> req.add_header('user-agent', "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11")
>>> resp = urllib2.urlopen(req)
>>> resp.read()[:10]
'\xff\xd8\xff\xe0\x00\x10JFIF'

See this question for more on setting the user agent header.

It's worth noting that the server admin is trying to block scripted downloads for a reason - bandwidth costs for example - so you should consider whether circumventing their blocking mechanism is an acceptable action, especially if your going to run the download often.

Community
  • 1
  • 1
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153