3

I have a list of article titles and ids that are used to generate the urls of the articles and scrape the contents. I'm using multiprocessing.Pool to parallelize the work. Here's my code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from article import Article
from signal import signal, SIGTERM
import multiprocessing as mp
import sys

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.binary_location = '*path*\chrome.exe'    
driver = webdriver.Chrome(executable_path="chromedriver", chrome_options=chrome_options)


def get_article(args):
    title, id, q = args
    article = Article.from_url('https://*url*/article/{}'.format(id), driver, title=title, id=id)
    print('parsed article: ', title)
    q.put(article.to_json())


def file_writer(q):
    with open('data/articles.json', 'w+') as file:
        while True:
            line = q.get()
            if line == 'END':
                break
            file.write(line + '\n')
            file.flush()


if __name__ == '__main__':
    manager = mp.Manager()
    queue = manager.Queue()
    pool_size = mp.cpu_count() - 2
    pool = mp.Pool(pool_size)
    writer = mp.Process(target=file_writer, args=(queue,))
    writer.start()

    with open('data/article_list.csv', 'r') as article_list:
        article_list_with_queue = [(*line.split('|'), queue) for line in article_list]
        pool.map(get_article, article_list_with_queue)

    queue.put('END')

    pool.close()
    pool.join()

    driver.close()

The code executes fine, but after it is finished I have about 80 child processes in PyCharm.exe. Most are chrome.exe, some - chromedriver.exe.

I tried to put

signal(SIGTERM, terminate)

in the worker function and quit the drivers in terminate(), but that doesn't work.

Petar Chernev
  • 173
  • 1
  • 7

1 Answers1

0

You can create .bat file for kill all processes:

@echo off
rem   just kills stray local chromedriver.exe instances.
rem   useful if you are trying to clean your project, and your ide is complaining.

taskkill /im chromedriver.exe /f

And run it after all tests

  • Yep, that's what I'm doing now (typing it in console instead of .bat), but that only kills the 7-8 chromedriver.exe processes. There are 70 more chrome.exe instances. Killing them also closes my own Chrome browser. I guess a follow-up question is how to taskkill only children of a process. Anyway, it's not that big a deal, but I would just like to be able to do it programmatically. It seems like the multiprocessing module/Selenium should have this sort of functionality. – Petar Chernev Jul 10 '18 at 10:21