2

I'm using requests to download some large files (100-5000 MB). I'm using session and urllib3.Retry to get automatic retries. It appears such retries only applies before the HTTP Header has been received and content has started streaming. After 200 has been sent, a network dip will be raised as a ReadTimeoutError.

See following example:

import requests, logging
from requests.adapters import HTTPAdapter
from urllib3 import Retry


def create_session():
    retries = Retry(total=5, backoff_factor=1)
    s = requests.Session()
    s.mount("http://", HTTPAdapter(max_retries=retries))
    s.mount("https://", HTTPAdapter(max_retries=retries))
    return s

logging.basicConfig(level=logging.DEBUG, stream=sys.stderr)
session = create_session()
response = session.get(url, timeout=(120, 10)) # Deliberate short read-timeout

This gives following log output:

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): example:443
DEBUG:urllib3.connectionpool:https://example:443 "GET /example.zip HTTP/1.1" 200 1568141974

< UNPLUG NETWORK CABLE FOR 10-15 sec HERE > 

Traceback (most recent call last):
  File "urllib3/response.py", line 438, in _error_catcher
    yield
  File "urllib3/response.py", line 519, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/usr/lib/python3.8/http/client.py", line 458, in read
    n = self.readinto(b)
  File "/usr/lib/python3.8/http/client.py", line 502, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "requests/models.py", line 753, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "urllib3/response.py", line 576, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "urllib3/response.py", line 541, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "urllib3/response.py", line 443, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='example', port=443): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "example.py", line 14, in _download
    response = session.get(url, headers=headers, timeout=300)
  File "requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "requests/sessions.py", line 697, in send
    r.content
  File "requests/models.py", line 831, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "requests/models.py", line 760, in generate
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='example', port=443): Read timed out. 

I can sort of understand why this would not work, things get even more obvious when you add the stream=True argument together with response.iter_content(). I assume the rationale is that the read_timeout and TCP layer should handle this (in my example i set read_timeout deliberately low to provoke it). But we have cases where servers restart/crash or where firewalls drop connections in the middle of a stream and the only option for a client is to retry the entire thing.

Is there any simple solution to this problem, ideally built into requests? One could always wrap the whole thing with tenacity or manual retries, but ideally i want to avoid that as it means adding another layer and one needs to identify network-errors from other real errors, etc.

ee555
  • 21
  • 1
  • If the file(s) you are downloading come from a server that supports e-tags and range requests, you can keep track of content-length and how much you have downloaded. If you get disconnected you can try to finish downloading the file when the upstream server becomes available again by using the etag to make sure the file has not changed and the range header to download the missing bytes without needing to start from scratch. – Lucas Scott May 20 '21 at 15:54

0 Answers0