Puppeteer: how to download entire web page for offline use

Question

How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.

However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling

html_contents = await page.content()

and saving the results, but that saves a copy without any non-HTML elements.

Is there way to save webpages for offline use with Puppeteer?

Puppeteer won't implement this https://github.com/GoogleChrome/puppeteer/issues/2433 — hardkoded, Feb 21 '19 at 19:38
Well.. that is surprising to me, as I can't think of a good reason why they wouldn't implement that. At any rate, I hope someone has made a third-party extension in that case. — Coolio2654, Feb 21 '19 at 20:10
Hi Coolio. Please do not (re)add conversational material to questions. Broadly, the readership here prefer a technical approach to writing, as succinctness is thought to add clarity. Gratitude is assumed by readers, and is best expressed in upvoting/acceptance. — halfer, Jan 28 '20 at 22:44
I do not agree with that assertion, as writing clearly demands a bit of a relaxed touch, but you are the mod, so fair enough. — Coolio2654, Jan 29 '20 at 02:37

score 36 · Answer 1 · answered Feb 21 '19 at 23:46

36

It is currently possible via experimental CDP call 'Page.captureSnapshot' using MHTML format:

'use strict';

const puppeteer = require('puppeteer');
const fs = require('fs');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('https://en.wikipedia.org/wiki/MHTML');

    const cdp = await page.target().createCDPSession();
    const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
    fs.writeFileSync('page.mhtml', data);

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

answered Feb 21 '19 at 23:46

vsemozhebuty

12,992
1
26
26

1

I will wait a bit to see if someone has managed to make a fork of Puppeteer that saves sites perfectly for offline use, but until then thanks for your clear example. Is there any news on how much development `captureSnapshot` is getting? As you yourself implied, it is missing a lot of features, though is slightly better than a raw html copy. – Coolio2654 Feb 22 '19 at 09:31
I am not aware of the details, sorry. If this format suffices, this depends on the needs of using the result. – vsemozhebuty Feb 22 '19 at 09:44
I have seen lot using cdp session, can you pls tell for what is used and where it is useful ? – Raj Saraogi Aug 11 '21 at 12:15
@RajSaraogi CDP is the protocol on which puppeteer's work is based, so it provides more posibilities than puppeteer's "suggar" API. See more here: https://chromedevtools.github.io/devtools-protocol/ – vsemozhebuty Aug 11 '21 at 12:54
@vsemozhebuty after downloading the offline page, how can I load it using puppeteer? I tried `goto('site.mhtml')` but I receive `Error: net::ERR_ABORTED` – itaied Jan 23 '22 at 12:36
1

@itaied If I understand correctly, you need the full absolute path to the file like `file:///path/to/site.mhtml`. – vsemozhebuty Jan 23 '22 at 19:43

Puppeteer: how to download entire web page for offline use

1 Answers1

Linked