1

OK, this is driving me nuts.

I am trying to read from the Crunchbase API using Python's Urllib2 library. Relevant code:

api_url="http://api.crunchbase.com/v/1/financial-organization/venrock.js"
len(urllib2.urlopen(api_url).read())

The result is either 73493 or 69397. The actual length of the document is much longer. When I try this on a different computer, the length is either 44821 or 40725. I've tried changing the user-agent, using Urllib, increasing the time-out to a very large number, and reading small chunks at a time. Always the same result.

I assumed it was a server problem, but my browser reads the whole thing.

Python 2.7.2, OS X 10.6.8 for the ~40k lengths. Python 2.7.1 running as iPython for the ~70k lengths, OS X 10.7.3. Thoughts?

Jerry Neumann
  • 415
  • 1
  • 4
  • 11
  • I think it actually is a server problem. I tried that URL with curl, and it didn't read it all either. ended with: curl: (18) transfer closed with 158818 bytes remaining to read "assets – Keith Jun 05 '12 at 02:20

2 Answers2

4

There is something kooky with that server. It might work if you, like your browser, request the file with gzip encoding. Here is some code that should do the trick:

import urllib2, gzip

api_url='http://api.crunchbase.com/v/1/financial-organization/venrock.js'
req = urllib2.Request(api_url)
req.add_header('Accept-encoding', 'gzip')
resp = urllib2.urlopen(req)
data = resp.read()

>>> print len(data)
26610

The problem then is to decompress the data.

from StringIO import StringIO

if resp.info().get('Content-Encoding') == 'gzip':
    g = gzip.GzipFile(fileobj=StringIO(data))
    data = g.read()

>>> print len(data)
183159
mhawke
  • 84,695
  • 9
  • 117
  • 138
2

I'm not sure if this is a valid answer, since it's a different module entirely but using the requests module, I get a ~183k response:

import requests

url = r'http://api.crunchbase.com/v/1/financial-organization/venrock.js'

r = requests.get(url)

print len(r.text)

>>>183159

So if it's not too late into the project, check it out here: http://docs.python-requests.org/en/latest/index.html

edit: Using the code you provided, I also get a len of ~36k

Did a quick search and found this: urllib2 not retrieving entire HTTP response

Community
  • 1
  • 1
TankorSmash
  • 12,186
  • 6
  • 68
  • 106
  • This is nice. `requests` includes this header in the request: `Accept-Encoding: identity, deflate, compress, gzip`. So gzip requests seem to work ok for that server. – mhawke Jun 05 '12 at 02:26
  • That's a nice library. I will definitely be looking into it, especially the async requests. Thanks. – Jerry Neumann Jun 05 '12 at 11:29