-2

I use a threading pool executor to run a function over a large number of endpoints. What I do not understand is that it is getting slower over time - e.g. initially it processes 5-6000 urls in the monitoring "refresh" interval, but this number keeps going down almost linearly. Where is the slowdown happening here (the hosts are all comparable api endpoints, with the same response time).

Obviously still much faster than a for loop, just very curious of the mechanics of it.

setting up and launching the pool

import requests
from concurrent.futures import ThreadPoolExecutor

hosts = [] ## list of 1mil+ endpoints to update.
successful_hosts = []

def fn_to_run(host):
    r = requests.get(host.url+'/endpoint')
    if r.status_code == 200:
        successful_hosts.append(host)

pool = ThreadPoolExecutor(1500) ##also tried very different numbers from [20:10000]
futures = [pool.submit(fn_to_run,host) for host in hosts]

monitoring

t_start = time.time()

while True:
     completed_futures = 0
     for future in futures:
         if future.done():
             completed_futures += 1
     t_now = time.time()

     print(f"""{completed_futures} hosts checked {round(completed_futures/len(futures)*100,1)}% done in {round((t_now - t_start)/60,1)} seconds""")

     if completed_futures/len(futures) >= 0.999:
         print('finishing and shutting down pool')
         break
     time.sleep(15) ## status refresh interval

I tried

  • changing the size of the pool (from 20 to 20000).
  • timing the monitoring function (running in the main thread) to make sure it was not the iteration over the futures that was taking too much time (it's not).
  • 1
    Can you add the output (or part of it?) – Jeppe Apr 16 '23 at 10:07
  • As `successful_hosts` gets longer and longer, garbage collection is going to get harder and harder and take longer and longer. Is there something you can do with these guys besides just put them in a list? That very long list of futures is probably bogging you down, too. – Frank Yellin Apr 16 '23 at 18:53
  • @FrankYellin The host objects already exist, how does putting them into that list make garbage collection slower? – Kelly Bundy Apr 16 '23 at 19:22
  • @FrankYellin Also, I don't think garbage collection takes anywhere near that much time. – Kelly Bundy Apr 16 '23 at 19:28
  • thanks a lot for the answers. @Jeppe it really decreases linearly, first run would get let's say 4000 requests, next one 3500, next one 3000 and so on. It does bottom out at some point, but while it is still hugely efficient I am very curious of the mechanics of it (for my knowledge). I did suspect garbage collection (or the list of "successful_hosts" growing) but could not get a silver bullet.. – SVK2022 Apr 16 '23 at 21:27
  • 1
    You could try profiling your script and see what takes time. If you want answers, I suggest to add some output to back your claim.. E.g. that a sequential loop doesn't suffer from it, and the threaded one which does. – Jeppe Apr 17 '23 at 04:33

1 Answers1

0

Try something like this:

def fn_to_run(host):
    r = requests.get(host.url+'/endpoint')
    return host if r.status_code == 200 else None

with ThreadPool() as pool:
    for host in pool.imap(fn_to_run, hosts):
        if host is not None:
            .....

If you don't care about the order of the results, pool.imap_unordered will be even better. The Python internals may know how to keep track of the millions of futures that you're creating and how to quickly get rid of them as each thread is done.

Frank Yellin
  • 9,127
  • 1
  • 12
  • 22