Python urllib2.urlopen: Read site-body even though there is an HTTP header error

Question

I have this simple URL which I want to call from my python script: http://test.my-site.com/bla-blah/createAccount (I changed some letters due to privacy, all special characters etc are exactly the same)

import urllib2

def myfunc(self, url):  
    result = urllib2.urlopen(url).read()
    # HTTP Error 400: Bad Request

When I call the above URL, I get the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 406, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 444, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request

I do not think it has something to do with quotes (and/or whitespaces obviously). When I call the URL http://test.my-site.com/bla-blah/listAccounts instead, it works fine, and in result is the exact same text I get when I call the URL in my browser. Of course I checked the first URL via browser, and it works fine.

Any idea what this might be?

Edit for clarification:

These two URLs should be callable without any further parameters or query strings, right as they stand there above. The site then should show something like "error: parameters missing". This does happen when I call the URLs in my browser or via curl in bash. Just the python module is making problems.

Edit2 (Also changed to post title to match the situation better)

Thanks, you were right: If I do curl -v 'http://test.my-site.com/bla-blah/createAccount', I get the following:

* About to connect() to <blackened> port 80 (#0)
*   Trying 193.46.215.110... connected
> GET <blackened> HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> Host: <blackened>
> Accept: */*
> 
< HTTP/1.1 400 Bad Request
< content-language: en-US
< server: <blackened>
< date: Thu, 04 Dec 2014 07:20:15 GMT
< set-cookie: beng_proxy_session=e2e037e7e79c1b03; HttpOnly; Path=/; Version=1; Discard
< p3p: CP="CAO PSA OUR"
< content-length: 234
< 
error: parameter x missing
error: parameter y missing
* Connection #0 to host <blackened> left intact
* Closing connection #0

As you can see, there is a HTTP header error. But curl (and browser) continue printing the site-body ("parameter missing..."), but python urllib stops after seeing the header error and does not print the body. (The header error btw is something that is sent by the server application, I guess. So this has nothing to do with python urllib) So we are one step closer, but I still need to see the body even if there is an error, because I have to know (and show) what exactly went wrong. But just now I was able to find a solution to that:

try:
    response = urllib2.urlopen("http://test.my-site.com/bla-blah/createAccount")
    contents = response.read()
    print("success: %s" % contents)
except urllib2.HTTPError as e:
    contents = e.read()
    print("error: %s" % contents)

This way I get the body of the site, no matter if error or success.

(Btw, this is the post I got the solution from: Overriding urllib2.HTTPError or urllib.error.HTTPError and reading response HTML anyway)

Thank you very much!

That does not matter, same happens when I call this stuff directly in python — Droids, Dec 03 '14 at 12:04

mhawke · Accepted Answer · 2014-12-04T04:06:23.647

Edit 2

Python raises an exception on receipt of a HTTP response with status code 400. There might be some text in the body of the response which you are not seeing because there is an exception and the data is not read. That text might be "error: parameters missing".

Possibly curl is doing the same thing, however, instead of having a fit, it displays the body of the response, so you see "error: parameters missing". Similar behaviour with your browser.

Try running curl -v http://test.my-site.com/bla-blah/createAccount. This runs curl in verbose mode and you will be able to see the response and check whether status code 400 is returned. If it is status code 400 then there is nothing wrong with urllib2.urlopen(), and you just need to send the parameters in the query string.

Edit 1

The following is the difference between a curl request and a urllib2.urlopen request...

[mhawke@localhost ~]$ python
GET /bla-blah/createAccount HTTP/1.1
Accept-Encoding: identity
Host: localhost:12345
Connection: close
User-Agent: Python-urllib/2.7

[mhawke@localhost ~]$ nc -l localhost 12345
GET /bla-blah/createAccount HTTP/1.1
User-Agent: curl/7.32.0
Host: localhost:12345
Accept: */*

Perhaps you can try to add/remove headers in Python to achieve the same request that curl generates.

Original answer

URL http://test.my-site.com/bla-blah/listAccounts looks like it would be a HTTP GET request while http://test.my-site.com/bla-blah/createAccount probably requires a HTTP POST request that includes the data fields required to "create an account".

I don't know what data is required by your server application, but (if my guess is correct) this is generally what you need to consider doing:

import urllib2
from urllib import urlencode

data = {'username': 'droids', 'password': '123droids321', 'phone': '012351234'}
result = urllib2.urlopen(url, urlencode(data)).read()

The presence of the urlencoded data generates a POST request, instead of the GET request that your current code would be issuing.

Note that there is a far more usable module for HTTP: requests. Check it out.

No, these requests are all called with GET, and can be called without any parameter. You get to see some basic HTML output then like "missing parameter". Also I just checked it with curl in bash, and it definitely works, only the python module has a problem with that... Sadly I cannot use requests, as I would have to manually pip-install it on every of our servers, which is simply not maintainable... — Droids, Dec 03 '14 at 12:11
Well, another possibility then is that `createAccount` requires a query string in the URL which passes the required parameters? You don't show that in your question and the obfuscation of URLs is not helping. — mhawke, Dec 03 '14 at 12:14
No, as I said I can call `http://test.my-site.com/bla-blah/createAccount` without any further parameters or query strings in the browser or via curl and it works — Droids, Dec 03 '14 at 12:24
So, your python code is: `urllib2.urlopen(url)` and your curl test command is: `curl url`? There are minor differences in the generated requests (I've added these to my answer), but none that _should_ trigger a HTTP 400 response. Are there any proxies that might be changing the request.? — mhawke, Dec 03 '14 at 12:34
Yes, these are my two commands (except I do `.read()` after urlopen). I don't think there are proxies... And sadly I cannot look into server logs, because a collegue is sick today. But I will try to get there some other way — Droids, Dec 03 '14 at 12:45
@Droids - updated answer. Please run curl in verbose mode to check the actual HTTP status code in the response. — mhawke, Dec 04 '14 at 04:07
Last comment gave the hint that I needed. See opening post. Thank you! — Droids, Dec 04 '14 at 08:48

Python urllib2.urlopen: Read site-body even though there is an HTTP header error

1 Answers1