urllib2 not retrieving entire HTTP response

Question

I'm perplexed as to why I'm not able to download the entire contents of some JSON responses from FriendFeed using urllib2.

>>> import urllib2
>>> stream = urllib2.urlopen('http://friendfeed.com/api/room/the-life-scientists/profile?format=json')
>>> stream.headers['content-length']
'168928'
>>> data = stream.read()
>>> len(data)
61058
>>> # We can see here that I did not retrieve the full JSON
... # given that the stream doesn't end with a closing }
... 
>>> data[-40:]
'ce2-003048343a40","name":"Vincent Racani'

How can I retrieve the full response with urllib2?

I get the full 165K of the response when hitting that URL with Firefox 3.0 on Ubuntu 9.04. The JSON document retrieved is well formed in my browser. — gotgenes, Dec 01 '09 at 14:23
Yes, the site is broken. But this is certainly a bug in both `urllib` and `urllib2`, since other tools (curl, wget) report incomplete response. It would be nice to know what is wrong in python libraries. — Denis Otkidach, Dec 01 '09 at 14:46
Ah, well, I just got an incomplete retrieval for a different room profile, http://friendfeed.com/api/room/friendfeed-feedback/profile?format=json, when retrieving it with my browser or with curl, so the response from the server does seem broken. I've sent an email to the API developer. Sorry for the wild goose chase. :-( I'll report back when he nabs the bug. — gotgenes, Dec 01 '09 at 15:45
I had the same problem. Bizarrely, urllib.urlretrieve() retrieves the entire thing (and puts it in a file), so maybe there's some code in it to use — dfrankow, Nov 19 '10 at 23:37
Note: my problem was a space (not %20, an actual space) in the URL. Apparently urllib.urlretrieve() is robust to spaces, but urllib2.urlopen() is not. — dfrankow, Nov 24 '10 at 04:03

score 18 · Accepted Answer · answered Dec 01 '09 at 05:25

18

Best way to get all of the data:

fp = urllib2.urlopen("http://www.example.com/index.cfm")

response = ""
while 1:
    data = fp.read()
    if not data:         # This might need to be    if data == "":   -- can't remember
        break
    response += data

print response

The reason is that .read() isn't guaranteed to return the entire response, given the nature of sockets. I thought this was discussed in the documentation (maybe urllib) but I cannot find it.

answered Dec 01 '09 at 05:25

Jed Smith

15,584
8
52
59

2

I couldn't get this example to work with the example URL given in the question, http://friendfeed.com/api/room/the-life-scientists/profile?format=json. The response is still incomplete. As I mentioned to John Weldon, repeat calls to `read()` only return empty strings, and `read()` seems exhaustive. – gotgenes Dec 01 '09 at 05:39
I only get 51.21 KB (52441 bytes) in my browser. The site is broken. – Jed Smith Dec 01 '09 at 05:46
Also doesn't work for http://www.nylonmag.com/modules/magsection/article/uploaded_images/5463_head_minnie%20big.jpg, although wget returns the full page, and Firefox can display the jpg. – dfrankow Nov 19 '10 at 22:30
Also, although this solution didn't work for me, that's what urlretrieve does (http://www.google.com/codesearch/p?hl=en#sRsuLDQ3rCI/trunk/sandbox/wierzbicki/test27/Lib/urllib.py&q=urllib%20%22def%20urlretrieve%22&l=247), and urlretrieve works for me! – dfrankow Nov 19 '10 at 23:45
1

The limitations of `read()` are discussed in the docs for `urllib` (http://docs.python.org/2/library/urllib.html#urllib.urlopen). The sentence "One caveat..." should be bold. – approxiblue Sep 18 '13 at 17:29
You can also do something like `response = b''.join(iter(fp.read, b''))` – Artyer Nov 09 '17 at 20:24

score 4 · Answer 2 · answered Nov 24 '10 at 14:43

Use tcpdump (or something like it) to monitor the actual network interactions - then you can analyze why the site is broken for some client libraries. Ensure that you repeat multiple times by scripting the test, so you can see if the problem is consistent:

import urllib2
url = 'http://friendfeed.com/api/room/friendfeed-feedback/profile?format=json'
stream = urllib2.urlopen(url)
expected = int(stream.headers['content-length'])
data = stream.read()
datalen = len(data)
print expected, datalen, expected == datalen

The site's working consistently for me so I can't give examples of finding failures :)

score 2 · Answer 3 · answered Dec 01 '09 at 04:54

2

Keep calling stream.read() until it's done...

while data = stream.read() :
    ... do stuff with data

answered Dec 01 '09 at 04:54

John Weldon

39,849
11
94
127

3

`read()` is exhaustive. Repeat calls to it return an empty string. – gotgenes Dec 01 '09 at 04:58

score 0 · Answer 4 · answered Dec 01 '09 at 05:03

0

readlines()

also works

answered Dec 01 '09 at 05:03

inspectorG4dget

110,290
27
149
241

1

It doesn't for me. `data = ''.join(stream.readlines()); print len(data); print(data[-40:])` gives identical results. – gotgenes Dec 01 '09 at 05:17
stream.readlines() returns a list of all the lines. But I just also realized that you are using the urllib2 module. My answer was based ont he urllib module which I have been using for longer and I just double checked the stream.readlines() from the urllib module and it works properly – inspectorG4dget Dec 02 '09 at 00:41

urllib2 not retrieving entire HTTP response

4 Answers4

Linked