5

I'm attempting to download around 3,000 files (each being maybe 3 MB in size) from Amazon S3 using requests_futures, but the download slows down badly after about 900, and actually starts to run slower than a basic for-loop.

It doesn't appear that I'm running out of memory or CPU bandwidth. It does, however, seem like the Wifi connection on my machine slows to almost nothing: I drop from a few thousand packets/sec to just 3-4. The weirdest part is that I can't load any websites until the Python process exits and I restart my wifi adapter.

What in the world could be causing this, and how can I go about debugging it?

If it helps, here's my Python code:

import requests
from requests_futures.sessions import FuturesSession
from concurrent.futures import ThreadPoolExecutor, as_completed

# get a nice progress bar
from tqdm import tqdm

def download_threaded(urls, thread_pool, session):
    futures_session = FuturesSession(executor=thread_pool, session=session)
    futures_mapping = {}
    for i, url in enumerate(urls):
        future = futures_session.get(url)
        futures_mapping[future] = i
    
    results = [None] * len(futures_mapping)

    with tqdm(total=len(futures_mapping), desc="Downloading") as progress:
        for future in as_completed(futures_mapping):
            try:
                response = future.result()
                result = response.text
            except Exception as e:
                result = e
            i = futures_mapping[future]
            results[i] = result
            progress.update()

    return results

s3_paths = []  # some big list of file paths on Amazon S3
def make_s3_url(path):
    return "https://{}.s3.amazonaws.com/{}".format(BUCKET_NAME, path)

urls = map(make_s3_url, s3_paths)
with ThreadPoolExecutor() as thread_pool:
    with requests.session() as session:
        results = download_threaded(urls, thread_pool, session)

Edit with various things I've tried:

  • time.sleep(0.25) after every future.result() (performance degrades sharply around 900)
  • 4 threads instead of the default 20 (performance degrades more gradually, but still degrades to basically nothing)
  • 1 thread (performance degrades sharply around 900, but recovers intermittently)
  • ProcessPoolExecutor instead of ThreadPoolExecutor (performance degrades sharply around 900)
  • calling raise_for_status() to throw an exception whenever the status is greater than 200, then catching this exception by printing it as a warning (no warnings appear)
  • use ethernet instead of wifi, on a totally different network (no change)
  • creating futures in a normal requests session instead of using a FutureSession (this is what I did originally, and found requests_futures while trying to fix the issue)
  • running the download only only a narrow range of files around the failure point (e.g. file 850 through file 950) -- performance is just fine here, print(response.status_code) shows 200 all the way, and no exceptions are caught.

For what it's worth, I have previously been able to download ~1500 files from S3 in about 4 seconds using a similar method, albeit with files an order of magnitude smaller

Things I will try when I have time today:

  • Using a for-loop
  • Using Curl in the shell
  • Using Curl + Parallel in the shell
  • Using urllib2

Edit: it looks like the number of threads is stable, but when the performance starts to go bad the number of "Idle Wake Ups" appears to spike from a few hundred to a few thousand. What does that number mean, and can I use it to solve this problem?

Edit 2 from the future: I never ended up figuring out this problem. Instead of doing it all in one application, I just chunked the list of files and ran each chunk with a separate Python invocation in a separate terminal window. Ugly but effective! The cause of the problem will forever be a mystery, but I assume it was some kind of problem deep in the networking stack of my work machine at the time.

shadowtalker
  • 12,529
  • 3
  • 53
  • 96
  • It might be a bug in your Wi-Fi driver when you flood it with session open requests, and even if it isn't, creating 1000 threads doesn't seem like a good strategy. Why not try `with ThreadPoolExecutor(max_workers=n) as thread_pool:` and search for an `n` that doesn't cause problems? Note that "Changed in version 3.5: If `max_workers` is *None* or not given, it will default to the number of processors on the machine, multiplied by 5", [according to the docs](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor) – Ken Y-N Oct 27 '16 at 00:12
  • I've updated my comment; however, your `FuturesSession()` call perhaps doubles the number of threads in use - try `n_cores * 2.5`? – Ken Y-N Oct 27 '16 at 00:17
  • @KenY-N I am indeed using 3.5 so this will have 20 workers. I will try with fewer. – shadowtalker Oct 27 '16 at 00:19
  • @KenY-N I got the idea to use FutureSession from that answer in the first place – shadowtalker Oct 27 '16 at 01:42
  • From where are the files being dowloaded? Is it from one server, or a small set of servers? Could the downloads be rate limited because the server sees so many requests coming from your machine? – mhawke Oct 27 '16 at 02:37
  • @mhawke it's from S3. I asked one of our engineers and he said that shouldn't be the problem. – shadowtalker Oct 27 '16 at 03:27
  • @ssdecontrol: perhaps not the problem given the sizes involved, but here is a reference that indicates that rate monitoring/limiting might occur: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html – mhawke Oct 27 '16 at 03:46
  • @mhawke I added a `time.sleep(0.25)` to each request and it still hits a wall around 900, so I doubt it's being rate-limited (that document talks about 100-800 requests _per second_ which I certainly am not making here) – shadowtalker Oct 27 '16 at 05:20
  • @KenY-N I tried it on Ethernet on a different network (home network) and still hit a wall at the same point. – shadowtalker Oct 27 '16 at 05:20
  • I tried it on my PC with Python 2.7 and `urls = ['https://s3.amazonaws.com'] * 3000` and `ThreadPoolExecutor(40)` (8 cores) and it runs steadily to completion in about a minute. This is on Linux; what OS do you have? Have you tried `print(e)` in your exception handler just in case you are missing an error? – Ken Y-N Oct 27 '16 at 09:26

1 Answers1

0

This isn't a surprise.

You don't get any parallelism when you have more threads than cores.

You can prove this to yourself by simplifying the problem to a single core with multiple threads.

What happens? You can only have one thread running at a time, so the operating system context switches each thread to give everyone a turn. One thread works, the others sleep until they are woken up in turn to do their bit. In that case you can't do better than single thread.

You may do worse because context switching and memory allocated for each thread (1MB each) have a price, too.

Read up on Amdahl's Law.

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • If the slowness is due to context-switching overhead, wouldn't it _start_ slow and stay that way? This runs blazing fast at first. Also isn't allowing the CPU to switch tasks the whole point of multithreading an I/O bound process? Or is the CPU actively required to process HTTP requests? – shadowtalker Oct 27 '16 at 00:17
  • It'll get worse with more threads, because there's more context switching. – duffymo Oct 27 '16 at 00:23
  • sure, but the number of threads shouldn't increase abruptly 1/3 of the way through – shadowtalker Oct 27 '16 at 01:43
  • I don't _believe_ I'm creating extra threads. As far as I understand the functions I used, 20 threads (4 cores x 5, the default) are created up-front and then re-used by the ThreadPoolExecutor as downloads complete. Hence my question; maybe I'm doing something wrong in the code that I don't realize is wrong. – shadowtalker Oct 27 '16 at 03:28
  • I tried it with 4 threads instead of 20; performance still drops badly around the 900 mark, but more gradually – shadowtalker Oct 27 '16 at 05:27
  • I also tried it with `ProcessPoolExecutor` instead of `ThreadPoolExecutor` -- same slowdown (around 900 downloads) – shadowtalker Oct 27 '16 at 05:37
  • Profile it - you are guessing without data. – duffymo Oct 27 '16 at 09:12
  • In this case the code is IO bound rather than processor bound, so there is a benefit from creating more threads than processors/cores. I was actually able to go up to 400 threads without a loss of performance. – Ken Y-N Oct 27 '16 at 09:43
  • What kind of profiling do you suggest? Memory usage? Function calls? I have been watching the Activity Monitor, so at the very least I can see how CPU/memory/network usage changes – shadowtalker Oct 27 '16 at 11:32
  • Something like VisualVM or dynaTrace. – duffymo Oct 27 '16 at 11:39