4

I'm trying to download a large file from a server with Python 2:

req = urllib2.Request("https://myserver/mylargefile.gz")
rsp = urllib2.urlopen(req)
data = rsp.read()

The server sends data with "Transfer-Encoding: chunked" and I'm only getting some binary data, which cannot be unpacked by gunzip.

Do I have to iterate over multiple read()s? Or multiple requests? If so, how do they have to look like?

Note: I'm trying to solve the problem with only the Python 2 standard library, without additional libraries such as urllib3 or requests. Is this even possible?

3 Answers3

1

From the python documentation on urllib2.urlopen:

One caveat: the read() method, if the size argument is omitted or negative, may not read until the end of the data stream; there is no good way to determine that the entire stream from a socket has been read in the general case.

So, read the data in a loop:

req = urllib2.Request("https://myserver/mylargefile.gz")
rsp = urllib2.urlopen(req)
data = rsp.read(8192)
while data:
   # .. Do Something ..
   data = rsp.read(8192)
jaime
  • 2,234
  • 1
  • 19
  • 22
  • I'm under the impression, that this works to download only files, that are not sent with transfer-encoding=chunked. –  Jun 24 '14 at 19:01
  • Hmm, you're right. I saw a similar question with no answer: http://stackoverflow.com/questions/15115606/urllib2-python-transfer-encoding-chunked Sorry, I'm not sure how to get past it. The only answer used curl. – jaime Jun 24 '14 at 19:16
  • OK, I'll try curl, which is a little bit cumbersome with login cookies compared to Python, but better than nothing. Thanks! –  Jun 24 '14 at 19:27
1

If I'm not mistaken, the following worked for me - a while back:

data = ''
chunk = rsp.read()
while chunk:
    data += chunk
    chunk = rsp.read()

Each read reads one chunk - so keep on reading until nothing more's coming. Don't have documenation ready supporting this...yet.

sebastian
  • 9,526
  • 26
  • 54
  • Unfortunately, this does not work for me: content = '' while True: chunk = rsp.read() if not chunk break content += chunk f.write(content) –  Jun 24 '14 at 18:55
  • `does not work` unfortunately is a very un-helpful statement :) Is there still content missing? – sebastian Jun 25 '14 at 06:09
  • Sorry: "does not work" in the sense of "exactly like before". I.e. the data is not complete and cannot be read by gunzip. I assume, that urllib2 just does not support chunked transfer-encoding. –  Jun 25 '14 at 11:51
0

I have the same problem.

I found that "Transfer-Encoding: chunked" often appears with "Content-Encoding: gzip".

So maybe we can get the compressed content and unzip it.

It works for me.

import urllib2
from StringIO import StringIO
import gzip

req = urllib2.Request(url)
req.add_header('Accept-encoding', 'gzip, deflate')
rsp = urllib2.urlopen(req)
if rsp.info().get('Content-Encoding') == 'gzip':
    buf = StringIO(rsp.read())
    f = gzip.GzipFile(fileobj=buf)
    data = f.read()