I have an asyncio
-based crawler that occasionally offloads crawling that requires the browser to a ThreadPoolExecutor, as follows:
def browserfetch(url):
browser = webdriver.Chrome()
browser.get(url)
# Some explicit wait stuff that can take up to 20 seconds.
return browser.page_source
async def fetch(url, loop):
with concurrent.futures.ThreadPoolExecutor() as pool:
result = await loop.run_in_executor(pool, browserfetch, url)
return result
My issue is that I believe this respawns the headless browser each time I call fetch
, which incurs browser startup time on each call to webdriver.Chrome
. Is there a way for me to refactor browserfetch
or fetch
so that the same headless driver can be used on multiple fetch
calls?
What have I tried?
I've considered more explicit use of threads/pools to start the Chrome
instance in a separate thread/process, communicating within the fetch
call via queues, pipes, etc (all run in Executors
to keep the calls from blocking). I'm not sure how to make this work, though.