0

I currently coded a scrapping function that works in my django web app and is hosted on Heroku, the scrapping function works through Celery and uses Undetected_ChromeDriver.

The main issue is that there seems to be a problem when running the driver through multithreaded code because the code acts this way :

Let's say I have 2 urls to scrap, the code is set to scrap this way :

results = [] 
    # Create a ThreadPoolExecutor with 10 workers
    with ThreadPoolExecutor(max_workers=10) as executor:
        # Assign tasks to each worker using the executor.submit method
        futures = [executor.submit(scrape_website, website, info, product_name) for website, info in websites.items()]

        # Wait for all workers to complete their tasks and retrieve their results
        for future in as_completed(futures):
            results.extend(future.result())

    return results

The code is set to open drivers this way :

def create_selenium_instance():
    
    options = uc.ChromeOptions()
    options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
    options.add_argument('--headless=new')
    options.add_argument('--no-sandbox')   
    options.add_argument("--disable-gpu")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-background-networking")
    options.add_argument("--disable-background-timer-throttling")
    options.add_argument("--disable-renderer-backgrounding")
    options.add_argument("--disable-sync")
    options.add_argument("--metrics-recording-only")
    options.add_argument("--disable-default-apps")
    options.add_argument("--mute-audio")
    options.add_argument("--no-first-run")
    options.add_argument("--disable-breakpad")
    chromedriver_path = os.environ.get("CHROMEDRIVER_PATH")
    renamed_path = "/app/.local/share/undetected_chromedriver/undetected_chromedriver"
    driver = uc.Chrome(executable_path=os.environ.get("CHROMEDRIVER_PATH"), options=options)
    return driver

The code is able to open the first URL and retrieve the correct informations (they get printed out in the logs) but as soon as it tries to open the second URL i get this error :

2023-04-22T15:27:32.678552+00:00 app[worker.1]: Traceback (most recent call last):
2023-04-22T15:27:32.678552+00:00 app[worker.1]:   File "/app/.heroku/python/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task
2023-04-22T15:27:32.678552+00:00 app[worker.1]:     R = retval = fun(*args, **kwargs)
2023-04-22T15:27:32.678553+00:00 app[worker.1]:   File "/app/.heroku/python/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call__
2023-04-22T15:27:32.678553+00:00 app[worker.1]:     return self.run(*args, **kwargs)
2023-04-22T15:27:32.678553+00:00 app[worker.1]:   File "/app/myapp/tasks.py", line 510, in search_product
2023-04-22T15:27:32.678554+00:00 app[worker.1]:     results.extend(future.result())
2023-04-22T15:27:32.678555+00:00 app[worker.1]:   File "/app/.heroku/python/lib/python3.10/concurrent/futures/_base.py", line 451, in result
2023-04-22T15:27:32.678556+00:00 app[worker.1]:     return self.__get_result()
2023-04-22T15:27:32.678556+00:00 app[worker.1]:   File "/app/.heroku/python/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
2023-04-22T15:27:32.678556+00:00 app[worker.1]:     raise self._exception
2023-04-22T15:27:32.678556+00:00 app[worker.1]:   File "/app/.heroku/python/lib/python3.10/concurrent/futures/thread.py", line 58, in run
2023-04-22T15:27:32.678556+00:00 app[worker.1]:     result = self.fn(*self.args, **self.kwargs)
2023-04-22T15:27:32.678557+00:00 app[worker.1]:   File "/app/myapp/tasks.py", line 397, in scrape_website
2023-04-22T15:27:32.678557+00:00 app[worker.1]:     driver = create_selenium_instance()
2023-04-22T15:27:32.678557+00:00 app[worker.1]:   File "/app/myapp/tasks.py", line 60, in create_selenium_instance
2023-04-22T15:27:32.678557+00:00 app[worker.1]:     driver = uc.Chrome(executable_path="/app/.chromedriver/bin/chromedriver", options=options)
2023-04-22T15:27:32.678558+00:00 app[worker.1]:   File "/app/.heroku/python/lib/python3.10/site-packages/undetected_chromedriver/__init__.py", line 246, in __init__
2023-04-22T15:27:32.678558+00:00 app[worker.1]:     self.patcher.auto()
2023-04-22T15:27:32.678559+00:00 app[worker.1]:   File "/app/.heroku/python/lib/python3.10/site-packages/undetected_chromedriver/patcher.py", line 127, in auto
2023-04-22T15:27:32.678559+00:00 app[worker.1]:     self.unzip_package(self.fetch_package())
2023-04-22T15:27:32.678559+00:00 app[worker.1]:   File "/app/.heroku/python/lib/python3.10/site-packages/undetected_chromedriver/patcher.py", line 180, in unzip_package
2023-04-22T15:27:32.678564+00:00 app[worker.1]:     os.rename(os.path.join(self.zip_path, self.exe_name), self.executable_path)
2023-04-22T15:27:32.678566+00:00 app[worker.1]: FileNotFoundError: [Errno 2] No such file or directory: '/app/.local/share/undetected_chromedriver/undetected/chromedriver' -> '/app/.local/share/undetected_chromedriver/undetected_chromedriver'

I tried to fix the error using a if statement to fix the path but I still get the same error :

def create_selenium_instance():
    
    options = uc.ChromeOptions()
    options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
    options.add_argument('--headless=new')
    options.add_argument('--no-sandbox')   
    options.add_argument("--disable-gpu")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-background-networking")
    options.add_argument("--disable-background-timer-throttling")
    options.add_argument("--disable-renderer-backgrounding")
    options.add_argument("--disable-sync")
    options.add_argument("--metrics-recording-only")
    options.add_argument("--disable-default-apps")
    options.add_argument("--mute-audio")
    options.add_argument("--no-first-run")
    options.add_argument("--disable-breakpad")
    chromedriver_path = os.environ.get("CHROMEDRIVER_PATH")
    renamed_path = "/app/.local/share/undetected_chromedriver/undetected_chromedriver"
    
    if not os.path.exists(chromedriver_path) and os.path.exists(renamed_path):
        os.environ["CHROMEDRIVER_PATH"] = renamed_path
    driver = uc.Chrome(executable_path=os.environ.get("CHROMEDRIVER_PATH"), options=options)
    return driver

Does anyone know is it's possible to fix this ?

Thank you !

Karuizawa
  • 1
  • 1

0 Answers0