12

I'm perplexed as to why I'm not able to download the entire contents of some JSON responses from FriendFeed using urllib2.

>>> import urllib2
>>> stream = urllib2.urlopen('http://friendfeed.com/api/room/the-life-scientists/profile?format=json')
>>> stream.headers['content-length']
'168928'
>>> data = stream.read()
>>> len(data)
61058
>>> # We can see here that I did not retrieve the full JSON
... # given that the stream doesn't end with a closing }
... 
>>> data[-40:]
'ce2-003048343a40","name":"Vincent Racani'

How can I retrieve the full response with urllib2?

gotgenes
  • 38,661
  • 28
  • 100
  • 128
  • 1
    Site's broken. Try in a browser. – Jed Smith Dec 01 '09 at 05:47
  • I get the full 165K of the response when hitting that URL with Firefox 3.0 on Ubuntu 9.04. The JSON document retrieved is well formed in my browser. – gotgenes Dec 01 '09 at 14:23
  • 3
    Yes, the site is broken. But this is certainly a bug in both `urllib` and `urllib2`, since other tools (curl, wget) report incomplete response. It would be nice to know what is wrong in python libraries. – Denis Otkidach Dec 01 '09 at 14:46
  • Ah, well, I just got an incomplete retrieval for a different room profile, http://friendfeed.com/api/room/friendfeed-feedback/profile?format=json, when retrieving it with my browser or with curl, so the response from the server does seem broken. I've sent an email to the API developer. Sorry for the wild goose chase. :-( I'll report back when he nabs the bug. – gotgenes Dec 01 '09 at 15:45
  • I had the same problem. Bizarrely, urllib.urlretrieve() retrieves the entire thing (and puts it in a file), so maybe there's some code in it to use – dfrankow Nov 19 '10 at 23:37
  • Note: my problem was a space (not %20, an actual space) in the URL. Apparently urllib.urlretrieve() is robust to spaces, but urllib2.urlopen() is not. – dfrankow Nov 24 '10 at 04:03

4 Answers4

18

Best way to get all of the data:

fp = urllib2.urlopen("http://www.example.com/index.cfm")

response = ""
while 1:
    data = fp.read()
    if not data:         # This might need to be    if data == "":   -- can't remember
        break
    response += data

print response

The reason is that .read() isn't guaranteed to return the entire response, given the nature of sockets. I thought this was discussed in the documentation (maybe urllib) but I cannot find it.

Jed Smith
  • 15,584
  • 8
  • 52
  • 59
  • 2
    I couldn't get this example to work with the example URL given in the question, http://friendfeed.com/api/room/the-life-scientists/profile?format=json. The response is still incomplete. As I mentioned to John Weldon, repeat calls to `read()` only return empty strings, and `read()` seems exhaustive. – gotgenes Dec 01 '09 at 05:39
  • I only get 51.21 KB (52441 bytes) in my browser. The site is broken. – Jed Smith Dec 01 '09 at 05:46
  • Also doesn't work for http://www.nylonmag.com/modules/magsection/article/uploaded_images/5463_head_minnie%20big.jpg, although wget returns the full page, and Firefox can display the jpg. – dfrankow Nov 19 '10 at 22:30
  • Also, although this solution didn't work for me, that's what urlretrieve does (http://www.google.com/codesearch/p?hl=en#sRsuLDQ3rCI/trunk/sandbox/wierzbicki/test27/Lib/urllib.py&q=urllib%20%22def%20urlretrieve%22&l=247), and urlretrieve works for me! – dfrankow Nov 19 '10 at 23:45
  • 1
    The limitations of `read()` are discussed in the docs for `urllib` (http://docs.python.org/2/library/urllib.html#urllib.urlopen). The sentence "One caveat..." should be bold. – approxiblue Sep 18 '13 at 17:29
  • You can also do something like `response = b''.join(iter(fp.read, b''))` – Artyer Nov 09 '17 at 20:24
4

Use tcpdump (or something like it) to monitor the actual network interactions - then you can analyze why the site is broken for some client libraries. Ensure that you repeat multiple times by scripting the test, so you can see if the problem is consistent:

import urllib2
url = 'http://friendfeed.com/api/room/friendfeed-feedback/profile?format=json'
stream = urllib2.urlopen(url)
expected = int(stream.headers['content-length'])
data = stream.read()
datalen = len(data)
print expected, datalen, expected == datalen

The site's working consistently for me so I can't give examples of finding failures :)

David Fraser
  • 6,475
  • 1
  • 40
  • 56
2

Keep calling stream.read() until it's done...

while data = stream.read() :
    ... do stuff with data
John Weldon
  • 39,849
  • 11
  • 94
  • 127
0
readlines() 

also works

inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • 1
    It doesn't for me. `data = ''.join(stream.readlines()); print len(data); print(data[-40:])` gives identical results. – gotgenes Dec 01 '09 at 05:17
  • stream.readlines() returns a list of all the lines. But I just also realized that you are using the urllib2 module. My answer was based ont he urllib module which I have been using for longer and I just double checked the stream.readlines() from the urllib module and it works properly – inspectorG4dget Dec 02 '09 at 00:41