I am trying to get HTML from an URL using Puppeteer without following redirection nor triggering related HTTP requests (CSS, images, etc.).
According to Puppeteer documentation, we can use page.setRequestInterception()
to ignore some requests.
I also found several SO questions such as this one that advises to use request.isNavigationRequest()
and request.redirectChain()
to determine if request is "main" or a redirection.
So I tried it, but I get Error: net::ERR_FAILED
errors.
Example with http://google.com
(which, when requested, answers with a 301 with Location: http://www.google.com/
header).
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', (interceptedRequest) => {
if (
interceptedRequest.isNavigationRequest()
&& interceptedRequest.redirectChain().length !== 0
) {
interceptedRequest.abort();
} else {
interceptedRequest.continue();
}
});
await page.goto('http://google.com');
const html = await page.content();
console.log(html);
await browser.close();
})();
Ran using node --trace-warnings file.js
, I get:
(node:14252) UnhandledPromiseRejectionWarning: Error: net::ERR_FAILED at http://google.com
at navigate (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async FrameManager.navigateFrame (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
at async Frame.goto (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
at async Page.goto (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:789:16)
at async /path_to_working_dir/file.js:17:3
at emitUnhandledRejectionWarning (internal/process/promises.js:168:15)
at processPromiseRejections (internal/process/promises.js:247:11)
at processTicksAndRejections (internal/process/task_queues.js:94:32)
(node:14252) Error: net::ERR_FAILED at http://google.com
at navigate (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async FrameManager.navigateFrame (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
at async Frame.goto (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
at async Page.goto (/path_to_working_dir/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:789:16)
at async /path_to_working_dir/file.js:17:3
(node:14252) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
at emitDeprecationWarning (internal/process/promises.js:180:11)
at processPromiseRejections (internal/process/promises.js:249:13)
at processTicksAndRejections (internal/process/task_queues.js:94:32)
Using the http://www.example.com
URL ( instead of http://google.com
) works fine: no error and I get what a curl http://www.example.com
would.
How can I discard unwanted redirection requests but still be able to perform actions on the "main" page (page.screenshot()
, page.content()
, ...)?