I use a threading pool executor to run a function over a large number of endpoints. What I do not understand is that it is getting slower over time - e.g. initially it processes 5-6000 urls in the monitoring "refresh" interval, but this number keeps going down almost linearly. Where is the slowdown happening here (the hosts are all comparable api endpoints, with the same response time).
Obviously still much faster than a for loop, just very curious of the mechanics of it.
setting up and launching the pool
import requests
from concurrent.futures import ThreadPoolExecutor
hosts = [] ## list of 1mil+ endpoints to update.
successful_hosts = []
def fn_to_run(host):
r = requests.get(host.url+'/endpoint')
if r.status_code == 200:
successful_hosts.append(host)
pool = ThreadPoolExecutor(1500) ##also tried very different numbers from [20:10000]
futures = [pool.submit(fn_to_run,host) for host in hosts]
monitoring
t_start = time.time()
while True:
completed_futures = 0
for future in futures:
if future.done():
completed_futures += 1
t_now = time.time()
print(f"""{completed_futures} hosts checked {round(completed_futures/len(futures)*100,1)}% done in {round((t_now - t_start)/60,1)} seconds""")
if completed_futures/len(futures) >= 0.999:
print('finishing and shutting down pool')
break
time.sleep(15) ## status refresh interval
I tried
- changing the size of the pool (from 20 to 20000).
- timing the monitoring function (running in the main thread) to make sure it was not the iteration over the futures that was taking too much time (it's not).