I am starting asynchronous tasks using python's concurrent.futures
ThreadPoolExecutor.
Following this approach, I monitor the progress of the async calls using the tqdm
progress bar.
My code looks like this:
with concurrent.futures.ThreadPoolExecutor(max_workers = n_jobs) as executor:
future_to_url = {executor.submit(target_function, URL): URL for URL in URL_list}
kwargs = {'total': len(future_to_url), # For tqdm
'unit': 'URL', # For tqdm
'unit_scale': True, # For tqdm
'leave': False, # For tqdm
'miniters': 50, # For tqdm
'desc': 'Scraping Progress'}
for future in tqdm(concurrent.futures.as_completed(future_to_url), **kwargs):
URL = future_to_url[future]
try:
data = future.result() # Concurrent calls
except Exception as exc:
error_handling() # Handle errors
else:
result_handling() # Handle non-errors
The console output looks like this:
Scraping Progress: 9%|▉ | 3.35k/36.2k [08:18<1:21:22, 6.72URL/s] # I want < 6/s
Scraping Progress: 9%|▉ | 3.40k/36.2k [08:26<1:21:16, 6.72URL/s] # I want < 6/s
Scraping Progress: 10%|▉ | 3.45k/36.2k [08:30<1:20:40, 6.76URL/s] # I want < 6/s
Scraping Progress: 10%|▉ | 3.50k/36.2k [08:40<1:20:51, 6.73URL/s] # I want < 6/s
Scraping Progress: 10%|▉ | 3.55k/36.2k [08:46<1:20:36, 6.74URL/s] # I want < 6/s
Scraping Progress: 10%|▉ | 3.60k/36.2k [08:52<1:20:17, 6.76URL/s] # I want < 6/s
I know I can set up a URL queue and control its size, as described here.
However, I don't know how to control throughput speed itself. Lets say I want not more than 6 URLs/sec. Can this be archived by something else than throwing in time.sleep(n) to target_function()
in the above example?
How to effectively control throughput speed of ThreadPoolExecutor
in python's concurrent.futures
?