0

I'm trying to scrape the contents of this site http://www.intoaqua.com.au and I can't seem to identify what causes pypetteer to believe that the site has not finished loading/rendering. I suspect it's the vimeo video animation, but I'm not sure. I noticed that there are some scripts that get called as well, but JS is not my forte and I do not understand what the scripts do/if they are interfering with pyppeteer.

I have a more comprehensive scraper script that I'm trying to improve. I started from scratch for the url below to try to simply the debugging process. I've outlined what I've tried so far below the first block of code.

Per the pypeteer README, below is the most basic script to navigate to and take a screenshot of a webpage. When I try the block of code below, I get a Navigation TimeoutError after 30 sec (the default timeout arg).

For reference, I'm running the following setup:

  • google-chrome-stable/now 95.0.4638.69-1 amd64
  • pyppeteer==1.0.2
    I can successfully run the example provided in the README with the version of Chrome and pyppeteer listed above.
import asyncio
from pyppeteer import launch

HEADLESS_CHROME_PATH = '/tmp/headless-chromium' # UPDATE to your headless-chrome path.
URL = '||ht||tp:||//||www.||intoaqua||.||com||.||au||'.replace("|","")
async def main():
    browser = await launch(
        executablePath=HEADLESS_CHROME_PATH,
    )
    page = await browser.newPage()
    await page.goto('http://www.intoaqua.com.au')
    await page.screenshot({'path': 'example.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

This is the last pyppeteer log that I receive before the timeout occurs:
timestamp - pyppeteer.connection.CDPSession - DEBUG - RECV:

{
    "method":"Network.requestWillBeSent",
     "params":{
         "requestId":"5FB7370AE30F687E8313AB0170BC100A",
         "loaderId":"5FB7370AE30F687E8313AB0170BC100A",
        "documentURL":"||ht||tps||:||/||/pla||yer||.||vi||meo||.c||om||/||vi||de||o/||591||976007||?loo||p=||1&a||uto||pl||ay=||1&||ti||tle=||0&||by||li||ne||=0&||se||tVo||lu||me=||0&a||pi=1||&pl||ay||er||_i||d=||1||",
         "request":{
             "url":"||ht||tps||:||/||/pla||yer||.||vi||meo||.c||om||/||vi||de||o/||591||976007||?loo||p=||1&a||uto||pl||ay=||1&||ti||tle=||0&||by||li||ne||=0&||se||tVo||lu||me=||0&a||pi=1||&pl||ay||er||_i||d=||1||",
             "method":"GET",
             "headers":{
                 "Upgrade-Insecure-Requests":"1",
                 "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",
                 "accept-language":"en-GB,en-US;q=0.9,en;q=0.8",
                 "Referer":"||ht||tp:||//||www.||intoaqua||.||com||.||au||/"
             },
             "mixedContentType":"none",
             "initialPriority":"VeryHigh",
             "referrerPolicy":"no-referrer-when-downgrade"
         },
         "timestamp":23541.14503,
         "wallTime":1647873017.37953,
         "initiator":{
             "type":"parser",
             "url":"||ht||tp:||//||www.||intoaqua||.||com||.||au||/",
             "lineNumber":813
         },
         "type":"Document",
         "frameId":"5AFC116D4AA48CAF019AA88708B716C1",
         "hasUserGesture":false
     }
}

I've tried all of the page.goto waitUntil options:

  • await page.goto('http://www.intoaqua.com.au', waitUntil='load')

  • await page.goto('http://www.intoaqua.com.au', waitUntil='domcontentloaded')

  • await page.goto('http://www.intoaqua.com.au', waitUntil='networkidle0')

  • await page.goto('http://www.intoaqua.com.au', waitUntil='networkidle2')

I've tried specifying addtional params for the browser launch:

browser = await launch(
    headless=True,
    executablePath=HEADLESS_CHROME_PATH,
    ignoreHTTPSErrors=True,
    userDataDir="/tmp/chrome/",
    args=[
        "--no-sandbox",
        "--disable-gpu",
        "--single-process",
        "--no-zygote",
        "--disable-infobars",
        "--disable-web-security",
        "--disable-webgl",
        "--disable-dev-shm-usage",
        "--ignore-certificate-errors",
        "--ignore-ssl-errors",
        "--incognito",
        "--no-referrers",
        "--no-proxy-server",
        "--stable-release-mode",
        "--enable-features=NetworkService",
    ],
    autoClose=False,
)

I've tried adding the headless detection script shared in the Intoli post: making-chrome-headless-undetectable (link removed to prevent spam flag getting raised).

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Question has too much code Please trim your code to make it easier to find your problem. Follow these guidelines to create a minimal reproducible example :https://stackoverflow.com/help/minimal-reproducible-example – D.L Mar 21 '22 at 15:21
  • Thanks for the feedback. The first code block is all that’s needed to reproduce the problem and only includes the minimum code required to reproduce the problem. I followed the StackOverflow instructions to include steps I’ve taken to try to fix the problem. Perhaps this is distracting and better left out. – Christian Mar 21 '22 at 22:43
  • Have you tried without `executablePath` so pyppeteer downloads and uses the version of chromium it was designed for? – Jerther May 20 '22 at 13:20
  • Also, one trick that might help you debug this. It prints events, including networkIdle (networkidle0) and networkAlmostIdle (networkidle2): `page._client.on('Page.lifecycleEvent', lambda e: print(e))` – Jerther May 20 '22 at 14:45

0 Answers0