I'm trying to scrape the contents of this site http://www.intoaqua.com.au
and I can't seem to identify what causes pypetteer to believe that the site has not finished loading/rendering. I suspect it's the vimeo video animation, but I'm not sure. I noticed that there are some scripts that get called as well, but JS is not my forte and I do not understand what the scripts do/if they are interfering with pyppeteer.
I have a more comprehensive scraper script that I'm trying to improve. I started from scratch for the url below to try to simply the debugging process. I've outlined what I've tried so far below the first block of code.
Per the pypeteer README, below is the most basic script to navigate to and take a screenshot of a webpage. When I try the block of code below, I get a Navigation TimeoutError after 30 sec (the default timeout arg).
For reference, I'm running the following setup:
- google-chrome-stable/now 95.0.4638.69-1 amd64
- pyppeteer==1.0.2
I can successfully run the example provided in the README with the version of Chrome and pyppeteer listed above.
import asyncio
from pyppeteer import launch
HEADLESS_CHROME_PATH = '/tmp/headless-chromium' # UPDATE to your headless-chrome path.
URL = '||ht||tp:||//||www.||intoaqua||.||com||.||au||'.replace("|","")
async def main():
browser = await launch(
executablePath=HEADLESS_CHROME_PATH,
)
page = await browser.newPage()
await page.goto('http://www.intoaqua.com.au')
await page.screenshot({'path': 'example.png'})
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
This is the last pyppeteer log that I receive before the timeout occurs:
timestamp - pyppeteer.connection.CDPSession - DEBUG - RECV:
{
"method":"Network.requestWillBeSent",
"params":{
"requestId":"5FB7370AE30F687E8313AB0170BC100A",
"loaderId":"5FB7370AE30F687E8313AB0170BC100A",
"documentURL":"||ht||tps||:||/||/pla||yer||.||vi||meo||.c||om||/||vi||de||o/||591||976007||?loo||p=||1&a||uto||pl||ay=||1&||ti||tle=||0&||by||li||ne||=0&||se||tVo||lu||me=||0&a||pi=1||&pl||ay||er||_i||d=||1||",
"request":{
"url":"||ht||tps||:||/||/pla||yer||.||vi||meo||.c||om||/||vi||de||o/||591||976007||?loo||p=||1&a||uto||pl||ay=||1&||ti||tle=||0&||by||li||ne||=0&||se||tVo||lu||me=||0&a||pi=1||&pl||ay||er||_i||d=||1||",
"method":"GET",
"headers":{
"Upgrade-Insecure-Requests":"1",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",
"accept-language":"en-GB,en-US;q=0.9,en;q=0.8",
"Referer":"||ht||tp:||//||www.||intoaqua||.||com||.||au||/"
},
"mixedContentType":"none",
"initialPriority":"VeryHigh",
"referrerPolicy":"no-referrer-when-downgrade"
},
"timestamp":23541.14503,
"wallTime":1647873017.37953,
"initiator":{
"type":"parser",
"url":"||ht||tp:||//||www.||intoaqua||.||com||.||au||/",
"lineNumber":813
},
"type":"Document",
"frameId":"5AFC116D4AA48CAF019AA88708B716C1",
"hasUserGesture":false
}
}
I've tried all of the page.goto waitUntil
options:
await page.goto('http://www.intoaqua.com.au', waitUntil='load')
await page.goto('http://www.intoaqua.com.au', waitUntil='domcontentloaded')
await page.goto('http://www.intoaqua.com.au', waitUntil='networkidle0')
await page.goto('http://www.intoaqua.com.au', waitUntil='networkidle2')
I've tried specifying addtional params for the browser launch:
browser = await launch(
headless=True,
executablePath=HEADLESS_CHROME_PATH,
ignoreHTTPSErrors=True,
userDataDir="/tmp/chrome/",
args=[
"--no-sandbox",
"--disable-gpu",
"--single-process",
"--no-zygote",
"--disable-infobars",
"--disable-web-security",
"--disable-webgl",
"--disable-dev-shm-usage",
"--ignore-certificate-errors",
"--ignore-ssl-errors",
"--incognito",
"--no-referrers",
"--no-proxy-server",
"--stable-release-mode",
"--enable-features=NetworkService",
],
autoClose=False,
)
I've tried adding the headless detection script shared in the Intoli post: making-chrome-headless-undetectable (link removed to prevent spam flag getting raised).