I'm attempting to download around 3,000 files (each being maybe 3 MB in size) from Amazon S3 using requests_futures
, but the download slows down badly after about 900, and actually starts to run slower than a basic for-loop.
It doesn't appear that I'm running out of memory or CPU bandwidth. It does, however, seem like the Wifi connection on my machine slows to almost nothing: I drop from a few thousand packets/sec to just 3-4. The weirdest part is that I can't load any websites until the Python process exits and I restart my wifi adapter.
What in the world could be causing this, and how can I go about debugging it?
If it helps, here's my Python code:
import requests
from requests_futures.sessions import FuturesSession
from concurrent.futures import ThreadPoolExecutor, as_completed
# get a nice progress bar
from tqdm import tqdm
def download_threaded(urls, thread_pool, session):
futures_session = FuturesSession(executor=thread_pool, session=session)
futures_mapping = {}
for i, url in enumerate(urls):
future = futures_session.get(url)
futures_mapping[future] = i
results = [None] * len(futures_mapping)
with tqdm(total=len(futures_mapping), desc="Downloading") as progress:
for future in as_completed(futures_mapping):
try:
response = future.result()
result = response.text
except Exception as e:
result = e
i = futures_mapping[future]
results[i] = result
progress.update()
return results
s3_paths = [] # some big list of file paths on Amazon S3
def make_s3_url(path):
return "https://{}.s3.amazonaws.com/{}".format(BUCKET_NAME, path)
urls = map(make_s3_url, s3_paths)
with ThreadPoolExecutor() as thread_pool:
with requests.session() as session:
results = download_threaded(urls, thread_pool, session)
Edit with various things I've tried:
time.sleep(0.25)
after everyfuture.result()
(performance degrades sharply around 900)- 4 threads instead of the default 20 (performance degrades more gradually, but still degrades to basically nothing)
- 1 thread (performance degrades sharply around 900, but recovers intermittently)
- ProcessPoolExecutor instead of ThreadPoolExecutor (performance degrades sharply around 900)
- calling
raise_for_status()
to throw an exception whenever the status is greater than 200, then catching this exception by printing it as a warning (no warnings appear) - use ethernet instead of wifi, on a totally different network (no change)
- creating futures in a normal requests session instead of using a FutureSession (this is what I did originally, and found requests_futures while trying to fix the issue)
- running the download only only a narrow range of files around the failure point (e.g. file 850 through file 950) -- performance is just fine here,
print(response.status_code)
shows 200 all the way, and no exceptions are caught.
For what it's worth, I have previously been able to download ~1500 files from S3 in about 4 seconds using a similar method, albeit with files an order of magnitude smaller
Things I will try when I have time today:
- Using a for-loop
- Using Curl in the shell
- Using Curl + Parallel in the shell
- Using urllib2
Edit: it looks like the number of threads is stable, but when the performance starts to go bad the number of "Idle Wake Ups" appears to spike from a few hundred to a few thousand. What does that number mean, and can I use it to solve this problem?
Edit 2 from the future: I never ended up figuring out this problem. Instead of doing it all in one application, I just chunked the list of files and ran each chunk with a separate Python invocation in a separate terminal window. Ugly but effective! The cause of the problem will forever be a mystery, but I assume it was some kind of problem deep in the networking stack of my work machine at the time.