I'm using requests to download some large files (100-5000 MB). I'm using session and urllib3.Retry to get automatic retries. It appears such retries only applies before the HTTP Header has been received and content has started streaming. After 200 has been sent, a network dip will be raised as a ReadTimeoutError.
See following example:
import requests, logging
from requests.adapters import HTTPAdapter
from urllib3 import Retry
def create_session():
retries = Retry(total=5, backoff_factor=1)
s = requests.Session()
s.mount("http://", HTTPAdapter(max_retries=retries))
s.mount("https://", HTTPAdapter(max_retries=retries))
return s
logging.basicConfig(level=logging.DEBUG, stream=sys.stderr)
session = create_session()
response = session.get(url, timeout=(120, 10)) # Deliberate short read-timeout
This gives following log output:
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): example:443
DEBUG:urllib3.connectionpool:https://example:443 "GET /example.zip HTTP/1.1" 200 1568141974
< UNPLUG NETWORK CABLE FOR 10-15 sec HERE >
Traceback (most recent call last):
File "urllib3/response.py", line 438, in _error_catcher
yield
File "urllib3/response.py", line 519, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/usr/lib/python3.8/http/client.py", line 458, in read
n = self.readinto(b)
File "/usr/lib/python3.8/http/client.py", line 502, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.8/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "requests/models.py", line 753, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "urllib3/response.py", line 576, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "urllib3/response.py", line 541, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File "urllib3/response.py", line 443, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='example', port=443): Read timed out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "example.py", line 14, in _download
response = session.get(url, headers=headers, timeout=300)
File "requests/sessions.py", line 555, in get
return self.request('GET', url, **kwargs)
File "requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "requests/sessions.py", line 697, in send
r.content
File "requests/models.py", line 831, in content
self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
File "requests/models.py", line 760, in generate
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='example', port=443): Read timed out.
I can sort of understand why this would not work, things get even more obvious when you add the stream=True
argument together with response.iter_content()
.
I assume the rationale is that the read_timeout and TCP layer should handle this (in my example i set read_timeout deliberately low to provoke it). But we have cases where servers restart/crash or where firewalls drop connections in the middle of a stream and the only option for a client is to retry the entire thing.
Is there any simple solution to this problem, ideally built into requests? One could always wrap the whole thing with tenacity or manual retries, but ideally i want to avoid that as it means adding another layer and one needs to identify network-errors from other real errors, etc.