To scrape a pool of URLs, I am paralell processing selenium with joblib. In this context, I am facing two challenges:
- Challenge 1 is to speed up this process. In the moment, my code opens and closes a driver instance for every URL (ideally would be one for every process)
- Challenge 2 is to get rid of the CPU-intensive while loop that I think I need to
continue
on empty results (I know that this is most likely wrong)
Pseudocode:
URL_list = [URL1, URL2, URL3, ..., URL100000] # List of URLs to be scraped
def scrape(URL):
while True: # Loop needed to use continue
try: # Try scraping
driver = webdriver.Firefox(executable_path=path) # Set up driver
website = driver.get(URL) # Get URL
results = do_something(website) # Get results from URL content
driver.close() # Close worker
if len(results) == 0: # If do_something() failed:
continue # THEN Worker to skip URL
else: # If do_something() worked:
safe_results("results.csv") # THEN Save results
break # Go to next worker/URL
except Exception as e: # If something weird happens:
save_exception(URL, e) # THEN Save error message
break # Go to next worker/URL
Parallel(n_jobs = 40)(delayed(scrape)(URL) for URL in URL_list))) # Run in 40 processes
My understanding is that in order to re-use a driver instance across iterations, the # Set up driver
-line needs to be placed outside scrape(URL)
. However, everything outside scrape(URL)
will not find its way to joblib's Parallel(n_jobs = 40)
. This would imply that you can't reuse driver instances while scraping with joblib which can't be true.
Q1: How to reuse driver instances during parallel processing in the above example?
Q2: How to get rid of the while-loop while maintaining functionality in the above-mentioned example?
Note: Flash and image loading is disabled in firefox_profile
(code not shown)