3

I am trying to get HTML from an URL using Puppeteer without following redirection nor triggering related HTTP requests (CSS, images, etc.).

According to Puppeteer documentation, we can use page.setRequestInterception() to ignore some requests.

I also found several SO questions such as this one that advises to use request.isNavigationRequest() and request.redirectChain() to determine if request is "main" or a redirection.

So I tried it, but I get Error: net::ERR_FAILED errors.

Example with http://google.com (which, when requested, answers with a 301 with Location: http://www.google.com/ header).

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setRequestInterception(true);
  page.on('request', (interceptedRequest) => {
    if (
        interceptedRequest.isNavigationRequest()
        && interceptedRequest.redirectChain().length !== 0
    ) {
        interceptedRequest.abort();
    } else {
        interceptedRequest.continue();
    }
  });
  await page.goto('http://google.com');
  const html = await page.content();
  console.log(html);
  await browser.close();
})();

Ran using node --trace-warnings file.js, I get:

(node:14252) UnhandledPromiseRejectionWarning: Error: net::ERR_FAILED at http://google.com
    at navigate (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async FrameManager.navigateFrame (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
    at async Frame.goto (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
    at async Page.goto (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:789:16)
    at async /path_to_working_dir/file.js:17:3
    at emitUnhandledRejectionWarning (internal/process/promises.js:168:15)
    at processPromiseRejections (internal/process/promises.js:247:11)
    at processTicksAndRejections (internal/process/task_queues.js:94:32)
(node:14252) Error: net::ERR_FAILED at http://google.com
    at navigate (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async FrameManager.navigateFrame (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
    at async Frame.goto (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
    at async Page.goto (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:789:16)
    at async /path_to_working_dir/file.js:17:3
(node:14252) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
    at emitDeprecationWarning (internal/process/promises.js:180:11)
    at processPromiseRejections (internal/process/promises.js:249:13)
    at processTicksAndRejections (internal/process/task_queues.js:94:32)

Using the http://www.example.com URL ( instead of http://google.com) works fine: no error and I get what a curl http://www.example.com would.

How can I discard unwanted redirection requests but still be able to perform actions on the "main" page (page.screenshot(), page.content(), ...)?

CDuv
  • 2,098
  • 3
  • 22
  • 28

1 Answers1

1

I know this one is a bit old but having encountered the same issue I thought it might still worth sharing how to resolve this issue.

The reason you're getting the net::ERR_FAILED error is because after the redirection happens, i.e. interceptedRequest.redirectChain().length != 0, interceptedRequest which represents the redirected request (i.e. to https://www.google.com/ in your case) now becomes the main navigation request as is evident by interceptedRequest.isNavigationRequest() being true, but you then go on to call abort() on it. According to this test spec in the Puppeteer source, aborting the main navigation request during a Page.goto() will result in a net::ERR_FAILED error. The solution is to catch and appropriately handle this error when it is thrown by the Page.goto() since it is not unexpected:

try {
    await page.goto('http://google.com');
} catch(e) {
    if (e instanceof Error && e.message.startsWith('net::ERR_FAILED')) {
        console.log('Redirection aborted');
    } else {
        throw e;
    }
}
Josh Correia
  • 3,807
  • 3
  • 33
  • 50
Kris Dover
  • 544
  • 5
  • 9