1

I have an asyncio-based crawler that occasionally offloads crawling that requires the browser to a ThreadPoolExecutor, as follows:

def browserfetch(url):
    browser = webdriver.Chrome()
    browser.get(url)
    # Some explicit wait stuff that can take up to 20 seconds.
    return browser.page_source

async def fetch(url, loop):
    with concurrent.futures.ThreadPoolExecutor() as pool:
        result = await loop.run_in_executor(pool, browserfetch, url)
    return result

My issue is that I believe this respawns the headless browser each time I call fetch, which incurs browser startup time on each call to webdriver.Chrome. Is there a way for me to refactor browserfetch or fetch so that the same headless driver can be used on multiple fetch calls?

What have I tried?

I've considered more explicit use of threads/pools to start the Chrome instance in a separate thread/process, communicating within the fetch call via queues, pipes, etc (all run in Executors to keep the calls from blocking). I'm not sure how to make this work, though.

alex_noname
  • 26,459
  • 5
  • 69
  • 86
MikeRand
  • 4,788
  • 9
  • 41
  • 70

1 Answers1

2

I believe that starting browsers in separate processes and communicate with him via queue is a good approach (and more scalable). The pseudo-code might look like this:

#  worker.py 
def entrypoint(in_queue, out_queue):  # run in process
    crawler = Crawler()
    browser = Browser()
    while not stop:
        command = in_queue.get()
        result = crawler.process(command, browser)
        out_queue.put(result)            

# main.py
import worker

in_queue, out_queue = Process(worker.entrypoint)
while not stop:
    in_queue.put(new_task)
    result = out_queue.get()
alex_noname
  • 26,459
  • 5
  • 69
  • 86
  • `main.py` calls to `put` and `get`: should these be executed in a `pool` to avoid blocking (e.g. `await loop.run_in_executor(None, inqueue.put, new_task)` and `result = await loop.run_in_executor(None, out_queue.get)`), assuming the loop is sitting in a coroutine? – MikeRand Jul 02 '20 at 18:09
  • You can just use repetitive `put_nowait/get_nowait` or leverage some ready classes like this https://stackoverflow.com/a/24704950/13782669 – alex_noname Jul 02 '20 at 18:19