I am screenshotting several thousand web pages with pyppeteer. I discovered by accident, that running the same script in 2 open terminals doubles the output I get. I tested this by opening up to 6 terminals and running the script and I was able to get up to 6 times the performance.
I am considering using loop.run_in_executor
to run the script in multiple processes or threads from a main program.
Is this the right call or is am I hitting some IO/CPU limit in my script?
Here is how I'm thinking of doing it. I don't know if this is the right thing to do.
import asyncio
import concurrent.futures
async def blocking_io():
# File operations (such as logging) can block the
# event loop: run them in a thread pool.
with open('/dev/urandom', 'rb') as f:
return f.read(100)
async def cpu_bound():
# CPU-bound operations will block the event loop:
# in general it is preferable to run them in a
# process pool.
return sum(i * i for i in range(10 ** 7))
def wrap_blocking_io():
return asyncio.run(blocking_io())
def wrap_cpu_bound():
return asyncio.run(cpu_bound())
async def main():
loop = asyncio.get_running_loop()
# Options:
# 1. Run in the default loop's executor:
result = await loop.run_in_executor(
None, wrap_blocking_io)
print('default thread pool', result)
# 2. Run in a custom thread pool:
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as pool:
result = await loop.run_in_executor(
pool, wrap_blocking_io)
print('custom thread pool', result)
# 3. Run in a custom process pool:
with concurrent.futures.ProcessPoolExecutor(max_workers=6) as pool:
result = await loop.run_in_executor(
pool, wrap_cpu_bound)
print('custom process pool', result)
asyncio.run(main())