2

I am screenshotting several thousand web pages with pyppeteer. I discovered by accident, that running the same script in 2 open terminals doubles the output I get. I tested this by opening up to 6 terminals and running the script and I was able to get up to 6 times the performance.

I am considering using loop.run_in_executor to run the script in multiple processes or threads from a main program.

Is this the right call or is am I hitting some IO/CPU limit in my script?

Here is how I'm thinking of doing it. I don't know if this is the right thing to do.

import asyncio
import concurrent.futures

async def blocking_io():
    # File operations (such as logging) can block the
    # event loop: run them in a thread pool.
    with open('/dev/urandom', 'rb') as f:
        return f.read(100)

async def cpu_bound():
    # CPU-bound operations will block the event loop:
    # in general it is preferable to run them in a
    # process pool.
    return sum(i * i for i in range(10 ** 7))

def wrap_blocking_io():
    return asyncio.run(blocking_io())

def wrap_cpu_bound():
    return asyncio.run(cpu_bound())

async def main():
    loop = asyncio.get_running_loop()
    # Options:
    # 1. Run in the default loop's executor:
    result = await loop.run_in_executor(
        None, wrap_blocking_io)
    print('default thread pool', result)
    # 2. Run in a custom thread pool:
    with concurrent.futures.ThreadPoolExecutor(max_workers=6) as pool:
        result = await loop.run_in_executor(
            pool, wrap_blocking_io)
        print('custom thread pool', result)
    # 3. Run in a custom process pool:
    with concurrent.futures.ProcessPoolExecutor(max_workers=6) as pool:
        result = await loop.run_in_executor(
            pool, wrap_cpu_bound)
        print('custom process pool', result)

asyncio.run(main())
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Jonathan
  • 10,792
  • 5
  • 65
  • 85
  • 1
    there's nothing particularly bad in using `loop.run_in_executor` within your async code – RomanPerekhrest Jun 07 '19 at 09:12
  • Is it okay to use it to run an asynchronous function? I've shared an example of how I'd do this. I don't think there's anything blocking in my code. – Jonathan Jun 07 '19 at 09:14

1 Answers1

1

I tested this by opening up to 6 terminals and running the script and I was able to get up to 6 times the performance.

Since pyppeteer is already asynchronous I presume you just don't run multiple browsers parallely and that's why you have increased output when you run multiple processes.

To run some coroutines concurrently ("in parallel") you usually use something like asyncio.gather. Does you code have it? If answer is no, check this example - this is how you should run multiple jobs:

responses = await asyncio.gather(*tasks)

If you already using asyncio.gather consider to provide Minimal, Reproducible Example to make it easier to understand what happens.

Mikhail Gerasimov
  • 36,989
  • 16
  • 116
  • 159