Python Urllib2 Reading only part of document

Question

OK, this is driving me nuts.

I am trying to read from the Crunchbase API using Python's Urllib2 library. Relevant code:

api_url="http://api.crunchbase.com/v/1/financial-organization/venrock.js"
len(urllib2.urlopen(api_url).read())

The result is either 73493 or 69397. The actual length of the document is much longer. When I try this on a different computer, the length is either 44821 or 40725. I've tried changing the user-agent, using Urllib, increasing the time-out to a very large number, and reading small chunks at a time. Always the same result.

I assumed it was a server problem, but my browser reads the whole thing.

Python 2.7.2, OS X 10.6.8 for the ~40k lengths. Python 2.7.1 running as iPython for the ~70k lengths, OS X 10.7.3. Thoughts?

I think it actually is a server problem. I tried that URL with curl, and it didn't read it all either. ended with: curl: (18) transfer closed with 158818 bytes remaining to read "assets — Keith, Jun 05 '12 at 02:20

score 4 · Accepted Answer · answered Jun 05 '12 at 02:03

4

There is something kooky with that server. It might work if you, like your browser, request the file with gzip encoding. Here is some code that should do the trick:

import urllib2, gzip

api_url='http://api.crunchbase.com/v/1/financial-organization/venrock.js'
req = urllib2.Request(api_url)
req.add_header('Accept-encoding', 'gzip')
resp = urllib2.urlopen(req)
data = resp.read()

>>> print len(data)
26610

The problem then is to decompress the data.

from StringIO import StringIO

if resp.info().get('Content-Encoding') == 'gzip':
    g = gzip.GzipFile(fileobj=StringIO(data))
    data = g.read()

>>> print len(data)
183159

answered Jun 05 '12 at 02:03

mhawke

84,695
9
117
138

This answers the question directly, and uses default modules, unlike mine. +1. – TankorSmash Jun 05 '12 at 03:06
A critical piece of knowledge I somehow managed to avoid learning all these long years: that my browser uses gzip. Thanks. – Jerry Neumann Jun 05 '12 at 11:28

score 2 · Answer 2 · edited May 23 '17 at 11:49

2

I'm not sure if this is a valid answer, since it's a different module entirely but using the requests module, I get a ~183k response:

import requests

url = r'http://api.crunchbase.com/v/1/financial-organization/venrock.js'

r = requests.get(url)

print len(r.text)

>>>183159

So if it's not too late into the project, check it out here: http://docs.python-requests.org/en/latest/index.html

edit: Using the code you provided, I also get a len of ~36k

Did a quick search and found this: urllib2 not retrieving entire HTTP response

edited May 23 '17 at 11:49

Community

1
1

answered Jun 05 '12 at 01:48

TankorSmash

12,186
6
68
106

This is nice. `requests` includes this header in the request: `Accept-Encoding: identity, deflate, compress, gzip`. So gzip requests seem to work ok for that server. – mhawke Jun 05 '12 at 02:26
That's a nice library. I will definitely be looking into it, especially the async requests. Thanks. – Jerry Neumann Jun 05 '12 at 11:29

Python Urllib2 Reading only part of document

2 Answers2