0

I am trying to scrape a bunch of urls using Selenium and BeautifulSoup. Because they are thousands and the processing that I need to do is complex and uses a lot of CPU, I need to do multiprocessing (as opposed to multithreading).

The problem right now is that I am opening and closing a Chromedriver instance once for each URL, which adds a lot of overhead and makes the process slow.

What I want to do is instead have a chromedriver instance for each subprocess, only open it once and keep it open until the subprocess finishes. However, my attempts to do it have been unsuccessful.

I tried creating the instances in the main process, dividing the set of URLS into the number of processes and sending each subprocess its subset of urls and a single driver as arguments, so that each subprocess would cycle through the urls that it got. But that didn't run at all, it did not give either results or error.

A solution similar to this with multiprocessing instead of threading got me a recursion-limit error (changing the recursion limit using sys would not help at all).

What else could I do to make this faster?

Below are the relevant parts of the code that actually works.

from bs4 import BeautifulSoup
import re
import csv
from datetime import datetime
import numpy as np
import concurrent.futures
import multiprocessing
from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--window-size=1920x1080')
options.add_argument('--no-sandbox')

def runit(row):
    driver = webdriver.Chrome(chrome_options=options)
    driver.set_page_load_timeout(500)
    driver.implicitly_wait(500)
    url = row[1]
    driver.get(url)
    html_doc = driver.page_source
    driver.quit()
    soup = BeautifulSoup(html_doc, 'html.parser')

    # Some long processing code that uses the soup object and generates the result object that is returned below with what I want    

    return result, row

if __name__ == '__main__':
    multiprocessing.freeze_support()
    print(datetime.now())
    # The file below has the list of all the pages that I need to process, along with some other pieces of relevant data
    # The URL is the second field in the csv file
    with open('D:\\Users\\shina\\OneDrive\\testTiles.csv') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        # I have 4 cores but Windows shows 8 logical processors, I have tried other numbers below 8, but 8 seems to bring the fastest results
        with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
            results = executor.map(runit, csv_reader)

        #At a later time I will code here what I will do with the results after all the processes finish.

    print(datetime.now())
shinaco
  • 11
  • 4

2 Answers2

0

I found a possible solution to my question myself.

The error I was making in my alternative solutions (not shown above) was that I was trying to create all the drivers in the main process and pass it as argument to each subprocess. This did not work well. So what I did instead is to create each chromedriver instance inside each subprocess, as you will see in my code below. Please note, however, that this code is not entirely efficient. This is because the rows are divided evenly by count between all the subprocesses, and not all pages are equal. This means that some subprocesses finish earlier than others, leading to subutilization of the CPU at the end. This, however, takes 42% less time than having the chromedirver instance open and quit for each URL. If anyone has a solution that would allow to do both things (efficient use of the CPU and each subprocess having its own chromedriver instance), I would be thankful.

def runit(part):
    driver = webdriver.Chrome(options=options)
    driver.implicitly_wait(500)
    driver.set_page_load_timeout(500)
    debug = False
    results = []
    keys = []
    #the subprocess now receives a bunch of rows instead of just one
    #so I have to cycle through them now
    for row in part:
        result = None
        try:
            #processFile is a function that does the processing of each URL
            result = processFile(row[1], debug, driver)
        except Exception as e:
            exc = str(e)
            print(f"EXCEPTION: {row[0]} caused {exc}")
        results.append(results)
        keys.append(row[0])
    driver.quit()
    return results, keys


if __name__ == '__main__':
    multiprocessing.freeze_support()
    maxprocessors = 8

    print(datetime.now())
    rows = []
    with open('D:\\Users\\shina\\OneDrive\\testTiles.csv') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        for row in csv_reader:
            rows.append(row)
    parts = []
    # I separate the rows into equal parts by count
    # However these parts are not equal in terms of required CPU time
    # Which creates CPU subutilization at the end
    for i in range(0, maxprocessors):
        parts.append(rows[i::maxprocessors])
    with concurrent.futures.ProcessPoolExecutor(max_workers=maxprocessors) as executor:
        results = executor.map(runit, parts)

    print(datetime.now())
shinaco
  • 11
  • 4
-1

At end of the day, you need more compute power to be running these types of test i.e. multiple computers, browerstack, saucelabs etc. Also, look into Docker where you can use your grid implementation to run tests on more than one browser.

https://github.com/SeleniumHQ/docker-selenium