19

How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.

However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling

html_contents = await page.content()

and saving the results, but that saves a copy without any non-HTML elements.

Is there way to save webpages for offline use with Puppeteer?

halfer
  • 19,824
  • 17
  • 99
  • 186
Coolio2654
  • 1,589
  • 3
  • 21
  • 46
  • 1
    Puppeteer won't implement this https://github.com/GoogleChrome/puppeteer/issues/2433 – hardkoded Feb 21 '19 at 19:38
  • Well.. that is surprising to me, as I can't think of a good reason why they wouldn't implement that. At any rate, I hope someone has made a third-party extension in that case. – Coolio2654 Feb 21 '19 at 20:10
  • 1
    @hardkoded There is an experimental way, see answer below. – vsemozhebuty Feb 21 '19 at 23:48
  • Hi Coolio. Please do not (re)add conversational material to questions. Broadly, the readership here prefer a technical approach to writing, as succinctness is thought to add clarity. Gratitude is assumed by readers, and is best expressed in upvoting/acceptance. – halfer Jan 28 '20 at 22:44
  • I do not agree with that assertion, as writing clearly demands a bit of a relaxed touch, but you are the mod, so fair enough. – Coolio2654 Jan 29 '20 at 02:37

1 Answers1

36

It is currently possible via experimental CDP call 'Page.captureSnapshot' using MHTML format:

'use strict';

const puppeteer = require('puppeteer');
const fs = require('fs');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('https://en.wikipedia.org/wiki/MHTML');

    const cdp = await page.target().createCDPSession();
    const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
    fs.writeFileSync('page.mhtml', data);

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();
vsemozhebuty
  • 12,992
  • 1
  • 26
  • 26
  • 1
    I will wait a bit to see if someone has managed to make a fork of Puppeteer that saves sites perfectly for offline use, but until then thanks for your clear example. Is there any news on how much development `captureSnapshot` is getting? As you yourself implied, it is missing a lot of features, though is slightly better than a raw html copy. – Coolio2654 Feb 22 '19 at 09:31
  • I am not aware of the details, sorry. If this format suffices, this depends on the needs of using the result. – vsemozhebuty Feb 22 '19 at 09:44
  • I have seen lot using cdp session, can you pls tell for what is used and where it is useful ? – Raj Saraogi Aug 11 '21 at 12:15
  • @RajSaraogi CDP is the protocol on which puppeteer's work is based, so it provides more posibilities than puppeteer's "suggar" API. See more here: https://chromedevtools.github.io/devtools-protocol/ – vsemozhebuty Aug 11 '21 at 12:54
  • @vsemozhebuty after downloading the offline page, how can I load it using puppeteer? I tried `goto('site.mhtml')` but I receive `Error: net::ERR_ABORTED` – itaied Jan 23 '22 at 12:36
  • 1
    @itaied If I understand correctly, you need the full absolute path to the file like `file:///path/to/site.mhtml`. – vsemozhebuty Jan 23 '22 at 19:43