12

I'm using Requests to download a file (several gigabytes) from a server. To provide progress updates (and to prevent the entire file from having to be stored in memory) I've set stream=True and wrote the download to a file:

with open('output', 'w') as f:
    response = requests.get(url, stream=True)

    if not response.ok:
        print 'There was an error'
        exit()

    for block in response.iter_content(1024 * 100):
        f.write(block)
        completed_bytes += len(block)
        write_progress(completed_bytes, total_bytes)

However, at some random point in the download, Requests throws a ChunkedEncodingError. I've gone into the source and found that this corresponds to an IncompleteRead exception. I inserted a log statement around those lines and found that e.partial = "\r". I know that the server gives the downloads low priority and I suspect that this exception occurs when the server waits too long to send the next chunk.

As is expected, the exception stops the download. Unfortunately, the server does not implement HTTP/1.1's content ranges, so I cannot simply resume it. I've played around with increasing urllib3's internal timeout, but the exception still persists.

Is there anyway to make the underlying urllib3 (or Requests) more tolerant of these empty (or late) chunks so that the file can completely download?

Bailey Parker
  • 15,599
  • 5
  • 53
  • 91
  • What platform are you on? Might I suggest the use of a tool that may be specialized for your use that you can call through the shell, such as curl? – std''OrgnlDave Sep 15 '16 at 00:14
  • can you try setting a longer timeout in the get (the kwarg timeout should be working with stream=True in 2.3, see https://github.com/kennethreitz/requests/issues/1803). I would also verify that your headers for content type and encoding match what you are expecting to ensure it's not truncating the stream – A Small Shell Script Sep 29 '16 at 02:58
  • Have you tried with a smaller block? Seems like I've always used 1024 or 2048. – Wyrmwood Jan 19 '17 at 22:39

1 Answers1

1
import httplib

def patch_http_response_read(func):
    def inner(*args):
        try:
            return func(*args)
        except httplib.IncompleteRead, e:
            return e.partial
    return inner

httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)

I can not reproduce your problem right now, but I think this could be a patch. It allows you to deal with defective http servers.

Most bad servers transmit all data, but due implementation errors they wrongly close session and httplib raise error and bury your precious bytes.

gushitong
  • 1,898
  • 16
  • 24