I am trying to scrape a bunch of urls using Selenium and BeautifulSoup. Because they are thousands and the processing that I need to do is complex and uses a lot of CPU, I need to do multiprocessing (as opposed to multithreading).
The problem right now is that I am opening and closing a Chromedriver instance once for each URL, which adds a lot of overhead and makes the process slow.
What I want to do is instead have a chromedriver instance for each subprocess, only open it once and keep it open until the subprocess finishes. However, my attempts to do it have been unsuccessful.
I tried creating the instances in the main process, dividing the set of URLS into the number of processes and sending each subprocess its subset of urls and a single driver as arguments, so that each subprocess would cycle through the urls that it got. But that didn't run at all, it did not give either results or error.
A solution similar to this with multiprocessing instead of threading got me a recursion-limit error (changing the recursion limit using sys would not help at all).
What else could I do to make this faster?
Below are the relevant parts of the code that actually works.
from bs4 import BeautifulSoup
import re
import csv
from datetime import datetime
import numpy as np
import concurrent.futures
import multiprocessing
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--window-size=1920x1080')
options.add_argument('--no-sandbox')
def runit(row):
driver = webdriver.Chrome(chrome_options=options)
driver.set_page_load_timeout(500)
driver.implicitly_wait(500)
url = row[1]
driver.get(url)
html_doc = driver.page_source
driver.quit()
soup = BeautifulSoup(html_doc, 'html.parser')
# Some long processing code that uses the soup object and generates the result object that is returned below with what I want
return result, row
if __name__ == '__main__':
multiprocessing.freeze_support()
print(datetime.now())
# The file below has the list of all the pages that I need to process, along with some other pieces of relevant data
# The URL is the second field in the csv file
with open('D:\\Users\\shina\\OneDrive\\testTiles.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
# I have 4 cores but Windows shows 8 logical processors, I have tried other numbers below 8, but 8 seems to bring the fastest results
with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
results = executor.map(runit, csv_reader)
#At a later time I will code here what I will do with the results after all the processes finish.
print(datetime.now())