0

I have created a web crawling solution using python and selenium which is running inside a docker container in a m4.2xlarge EC2 instance. It also uses multiprocessing with Pool method

with Pool(processes=(config.no_of_cpus)) as pool:
  pool.map(func, items)
pool.join()

Whenever I run it with ~1000 urls in the input, it fails with the error "failed to start a thread for the new session" or "DevToolsActivePort file doesn't exist" or "InvalidSessionException" after successfully processing 300-400 urls.

Below is my docker config

FROM amazonlinux
USER root
RUN yum -y update &&\
    yum install -y shadow-utils gcc tar curl gzip make zlib-devel mysql-devel python3-devel python3-setuptools python3-virtualenv python3
RUN /usr/sbin/useradd -m spider
COPY ./app/ /home/spider/app/
WORKDIR /home/spider/app/
RUN sh install-google-chrome.sh
ENV PYTHONPATH /app
ENV PATH $PATH:/home/spider
RUN python3 -m venv .
RUN chown -R spider /home/spider/app
USER spider
RUN yes | . ./bin/activate | pip3 install --user --upgrade pip cryptography requests
RUN yes | . ./bin/activate | pip3 install --user -r requirements.txt
RUN python3 download_chromedriver.py
ENTRYPOINT [ "/usr/bin/python3","batch.py" ]
CMD [ "default-arg" ]

Same app code works absolutely fine without docker. I am unable to understand what specific docker settings I need to set here. I already tried

  • setting "default-shm-size": "5G" in /etc/docker/daemon.json
  • reinstalling docker

I am using the below code to make sure that my chromedriver version always matches up with the google-chrome version

"""
This script checks currently installed google-chrome and download corresponding chromedriver. Works with Windows and Posix.
."""
from selenium import webdriver
import requests
import zipfile
import wget
import subprocess
import os


CHROMEDRIVER_PATH =  os.path.join(os.getcwd(), "Scripts", "") if os.name == "nt" else os.path.join(os.getcwd(), "bin", "")
CHROMEDRIVER_FOLDER = os.path.dirname(CHROMEDRIVER_PATH)
LATEST_DRIVER_URL = "https://chromedriver.storage.googleapis.com/LATEST_RELEASE"


def download_latest_version(version_number):
    print("Attempting to download latest driver online......")
    driver_file = "/chromedriver_win32.zip" if os.name == "nt" else "/chromedriver_linux64.zip"

    download_url = "https://chromedriver.storage.googleapis.com/" + version_number + driver_file

    # download zip file
    latest_driver_zip = wget.download(download_url, out=CHROMEDRIVER_FOLDER)
    subprocess.Popen("chmod +x " + latest_driver_zip, shell=True)
    # read & extract the zip file
    with zipfile.ZipFile(latest_driver_zip, 'r') as downloaded_zip:
        # You can chose the folder path to extract to below:
        downloaded_zip.extractall(path=CHROMEDRIVER_FOLDER)
    
    if os.name == "posix":
        subprocess.Popen("chmod +x " + CHROMEDRIVER_FOLDER + "/chromedriver", shell=True)

    # delete the zip file downloaded above
    os.remove(latest_driver_zip)


def check_driver():
    # run cmd line to check for existing web-driver version locally
    cmd_output = ""
    if os.name == "nt":
        command = 'wmic datafile where name="C:\\\\Program Files\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value'
        cmd_run = subprocess.run(command, capture_output=True, text=True)
        cmd_output = cmd_run.stdout.split("=")[1]
    else:
        p1 = subprocess.Popen(["google-chrome", "--version"], stdout=subprocess.PIPE)
        p2 = subprocess.Popen(["cut -d ' ' -f 3"], shell=True, stdin=p1.stdout, stdout=subprocess.PIPE)
        cmd_output = p2.communicate()[0].decode("utf-8").strip()

    # Extract chrome version as string from terminal output
    local_chrome_version = cmd_output
    local_chrome_version = ".".join(local_chrome_version.split(".")[:-1])

    # check for latest chromedriver version online
    response = requests.get(LATEST_DRIVER_URL + "_" + local_chrome_version)
    online_driver_version = response.text

    if local_chrome_version == online_driver_version:
        return True
    else:
        download_latest_version(online_driver_version.strip())

check_driver()

Please let me know if more information is required for helping in debugging.

Saurabh Saxena
  • 3,005
  • 10
  • 31
  • 46

1 Answers1

0

Crawl was getting stopped because selenium generates too many zombie processes and the system after a certain threshold (~27500 for centos) stops processing the docker. Way around is to treat the docker process like PID 1 with the help of the --init flag like below.

docker run --init containername

More details on --init are here. It uses tini to take care of zombie processes.

Saurabh Saxena
  • 3,005
  • 10
  • 31
  • 46