0

When I run the script in headless mode, it simply times out on page.goto(url). When I run it with headless:false and just let it do its thing, you can see URL start to load for a moment, then go into a sort of redirect and endless loading.

However, if while in headless:false, I open up a new tab and manually navigate to URL, then the original tab will load fine. I'm already taking a lot of steps to avoid detection here;

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
const userAgent = require('user-agents');

await puppeteer.use(StealthPlugin())
var browser = await puppeteer.launch({headless: false});

let page = await browser.newPage();

page.setViewport({
                width: 1200,
                height: 800,
                deviceScaleFactor: 1,
                hasTouch: false,
                isLandscape: true,
                isMobile: false,
            });

var agent = userAgent.random()
            await page.setUserAgent(agent.toString());         
            await page.setJavaScriptEnabled(true);
            // Pass the Webdriver Test.
            await page.evaluateOnNewDocument(() => {
                Object.defineProperty(navigator, 'webdriver', {
                get: () => false,
                });
            });

            // Pass the Chrome Test.
            await page.evaluateOnNewDocument(() => {
                // We can mock this in as much depth as we need for the test.
                window.navigator.chrome = {
                runtime: {},
                // etc.
                };
            });

            // Pass the Permissions Test.
            await page.evaluateOnNewDocument(() => {
                const originalQuery = window.navigator.permissions.query;
                return window.navigator.permissions.query = (parameters) => (
                parameters.name === 'notifications' ?
                    Promise.resolve({ state: Notification.permission }) :
                    originalQuery(parameters)
                );
            });

            // Pass the Plugins Length Test.
            await page.evaluateOnNewDocument(() => {
                // Overwrite the `plugins` property to use a custom getter.
                Object.defineProperty(navigator, 'plugins', {
                // This just needs to have `length > 0` for the current test,
                // but we could mock the plugins too if necessary.
                get: () => [1, 2, 3, 4, 5],
                });
            });

            // Pass the Languages Test.
            await page.evaluateOnNewDocument(() => {
                // Overwrite the `plugins` property to use a custom getter.
                Object.defineProperty(navigator, 'languages', {
                get: () => ['en-US', 'en'],
                });
            });
            const session = await page.target().createCDPSession();
            await session.send("Page.enable");
            await session.send("Page.setWebLifecycleState", { state: "active" });
            await page.bringToFront();
                    
            await page.goto(url, {waitUntil: "networkidle2"} );

Any ideas how I'm still tipping them off that I'm running puppeteer unless I manually open a new tab and type into the address bar? Or, is there a way to force a more human-like interaction in the browser that opens the new tab and might allow me to do this headless?

edit: To be clear when I say "go into a sort of redirect and endless loading", what happens is that I see a brief flash of the page rendering, and then it goes to a blank white page. No change is noticed in the address bar but the loading icon indicator seems to show some type of redirection or refreshing. Whether I manually open the new tab before, during or after the puppeteer-created tab, as soon as the manual tab begins to load the URL, the puppeteer-created tab suddenly begins working.

3 Answers3

0

The problem is with waitUntil and networkidle2 and it happens because sometimes the rules that puppeteer follows are much stricter than what we consider as a "fully loaded webpage". Even if you as a human can decide whether your desired element is in the DOM already (because you see the actual element) or it is not there (because you don't see it), puppeteer may waits for something else. E.g.: you will see that your element is already there even if the background image is still loading in the background.

Note: the not correctly chosen waitUntil options can result in a blank page, I reproduced your issue by modifying puppeteer's default timeout to shorter periods.

(1) If you don't need every network connection for your task you could speed up page loading by replacing waitUntil: 'networkidle2' to waitUntil: 'domcontentloaded' as this event happens usually earlier.

await page.goto(url, { waitUntil: 'domcontentloaded' })

The possible waitUntil options are:

  • load: when load event is fired.
  • domcontentloaded: when the DOMContentLoaded event is fired.
  • networkidle0: when there are no more than 0 network connections for at least 500 ms.
  • networkidle2: when there are no more than 2 network connections for at least 500 ms.

[source]

(2) If you are dealing with a single-page app and you cannot rely on the load nor domcontentloaded events, but the networkidle ones would time out, then you can try waiting for a selector rather than an event, a key selector that ensures the required JavaScript bundles are loaded and the page is functional:

page.goto(url) // not await-ed
await page.waitForSelector(keySelector) // ensures the page is functional

Disclaimer: Even if the questions are different but the answer is similar to How to speed up puppeteer?

theDavidBarton
  • 7,643
  • 4
  • 24
  • 51
  • I appreciate the rundown of the `waitUntil` options in more detail, but unfortunately experimenting with different settings, or removing waitUntil entirely, does not solve my problem. I'm fairly certain this is some type of bot detection -- the site in question is aggressive in that regard. I'm navigating to a search engine first and clicking thru to the site, all other pages on other domains load fine, it is just this domain. It somehow reacts differently to me if and only if I manually open a tab and type in their url in headful mode. Still stumped I'm afraid. – David Claiborne Jul 19 '21 at 20:57
0

You could try to change the timeout value to see if it helps using the page.setDefaultTimeout(ms) method.
A value of 0 means no timeout, while default value is 30s or more exactly 30000 ms.

Are you sure that this is the page.goto() instruction that times out ? as any instruction using the page object may timeout if it exceeds the default page timeout value and as your code do not use error catching it may be any of them that produces that timeout error.

Steps for avoiding detection are all required ? What happens if you dont use them ?
The await instructions are all in an async scope ? If not that could be the source of your error.

There is also this method to try Page.setDefaultNavigationTimeout(ms).

If a page took more than 30s to load for whatever reason and its loading time always exceeds the defined timeout value it may block a program execution depending on how the case is handled. If the program is defined to repeat the instruction endlessly then it will result in endless attempts to load that page, each attempt being interrupted by timeout.

Darkosphere
  • 107
  • 8
0

I ran into that problem once and my quick and dirty solution was to use Chrome instead of chromium.

You do that like so:

const browser = await puppeteer.launch({ executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe',
 headless:false,
  });

You replace the string after executablePath with the path to Google Chrome on your machine. I don't know if it will work for you or not, but I had a similar problem a long time ago and this wound up working. In the end it depends on what tools the site you're scraping is using to detect you.

Z-Man Jones
  • 187
  • 1
  • 12