33

I'm using puppeteer for scraping some pages, but I'm curious about how to manage this in production for a node app. I'll be scraping up to 500,000 pages in a day, but these scrape jobs will happen at random intervals, so it's not a single queue that I can plow through.

What I'm wondering is, is it better to open a browser, go to the page, then close the browser between each job? Which I would assume would be a lot slower, but maybe handle memory better?

Or do I open one global browser when the app boots, and then just go to the page, and have some way to dump that page when I'm done with it (e.g. closing all tabs in chrome, but not closing chrome) then just re-open a new page when I need it? This way seems like it would be faster, but could potentially eat up lots of memory.

I've never worked with this library especially in a production environment, so I'm not sure if there's things I should watch out for.

jeremywoertink
  • 2,281
  • 1
  • 23
  • 29

3 Answers3

38

You probably want to create a pool of multiple Chromium instances with independent browsers. The advantage of that is, when one browser crashes all other jobs can keep running. The advantage of one browser (with multiple pages) is a slight memory and CPU advantage and the cookies are shared between your pages.

Pool of puppeteer instances

The library puppteer-cluster (disclaimer: I'm the author) creates a pool of browsers or pages for you. It takes care of the creation, error handling, browser restarting, etc. for you. So you can simply queue jobs/URLs and the library takes care of everything else.

Code sample

const { Cluster } = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER, // use one browser per worker
        maxConcurrency: 4, // cluster with four workers
    });

    // Define a task to be executed for your data (put your "crawling code" in here)
    await cluster.task(async ({ page, data: url }) => {
        await page.goto(url);
        // ...
    });

    // Queue URLs when the cluster is created
    cluster.queue('http://www.google.com/');
    cluster.queue('http://www.wikipedia.org/');

    // Or queue URLs anytime later
    setTimeout(() => {
        cluster.queue('http://...');
    }, 1000);
})();

You can also queue functions directly in case you have different task to do. Normally you would close the cluster after you are finished via cluster.close(), but you are free to just let it stay open. You find another example for a cluster that gets data when a request comes in in the repository.

Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105
  • Awesome! Thanks for the suggestion, I'll take a look at that. – jeremywoertink Sep 10 '18 at 20:26
  • The cluster doesn't listen to queue :( Would anyone like to work on the "queue" with the author. – CodeGuru Jun 22 '19 at 07:01
  • 1
    what if I have 10 clients calling my node app, each process opens 1 chrome instance and for each instance opens 3 tabs? is there a way to manage these memory/cpu usage? – rodrigocprates Nov 21 '19 at 15:38
  • @rodrigocprates Yes, you can limit the instances. Check out the [execute](https://github.com/thomasdondorf/puppeteer-cluster#clusterexecutedata--taskfunction) function and [this example](https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/express-screenshot.js). – Thomas Dondorf Nov 21 '19 at 16:42
33

If you are scraping 500,000 pages per day (approximately one page every 0.1728 seconds), then I would recommend opening a new page in an existing browser session rather than opening a new browser session for each page.

You can open and close a Page using the following method:

const page = await browser.newPage();
await page.close();

If you decide to use one Browser for your project, I would make sure to implement error handling procedures to ensure that if the program crashes, you have minimal downtime while you create a new Page, Browser, or BrowserContext.

Grant Miller
  • 27,532
  • 16
  • 147
  • 165
  • 1
    Also, each chrome tab instance is separated now so hopefully that should prevent memory from being eaten up too badly. – DrCord Aug 22 '18 at 18:40
  • I tried moving browser outside of the function that runs the scrape, but since the `puppeteer.launch` is async, it complains that the browser isn't ready essentially. I'll update with a code sample – jeremywoertink Aug 22 '18 at 23:20
  • 1
    Ok, I actually came across another called puppeteer-pool that is going to work. Thanks for the help! – jeremywoertink Aug 23 '18 at 01:49
  • Did you even try running puppeteer at that rate? How many cores? I tried using 10core * 2, I don't think I manage to hit 0.17 – CodeGuru Jun 22 '19 at 07:00
  • If you are scraping 500,000 pages and can invest time/skill - CEF is worth a look - https://en.wikipedia.org/wiki/Chromium_Embedded_Framework – Deepan Prabhu Babu Apr 18 '21 at 19:03
23
  • Reuse the browser and page instances instead of launching a browser each time
  • Also expose your chrome scraper to take requests from a queue rather than a rest endpoint. This would make sure chrome can take its sweet time and also if something crashes, the requests are in the queue.

Other performance related articles are,

  1. Do not render images, fonts and stylesheets if you would need only the DOM - https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
  2. Improving Performance - https://docs.browserless.io/blog/2019/05/03/improving-puppeteer-performance.html
  3. If you have enough time - CEF is worth another look - https://bitbucket.org/chromiumembedded/cef/src/master/ - CEF allows you to embed chromium into your own code, instead of using libraries - like puppeteer. (Puppeteer works by launching chrome on the side and communicating with it).
  4. Also check out Microsoft's Playwright before investing time into puppeteer ( https://playwright.dev/ ).
  5. This is a tutorial to implement web scraping - using k8, openfaas and puppeteer - https://www.openfaas.com/blog/puppeteer-scraping/
  6. This is an important article on how to use a proxy server to scrape using headless chrome and puppeteer - https://blog.apify.com/how-to-make-headless-chrome-and-puppeteer-use-a-proxy-server-with-authentication-249a21a79212/

This is another example using puppeteer and generic-pool libraries.

const puppeteer = require('puppeteer');
const genericPool = require("generic-pool");

async function createChromePool() {
    
    const factory = {
        create: function() {
            //open an instance of the Chrome headless browser - Heroku buildpack requires these args
            return puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox', '--ignore-certificate-errors'] });
        },
        destroy: function(client) {
            //close the browser
            client.close();
        }
    };  
    const opts = { max: 1, acquireTimeoutMillis: 120000, priorityRange: 3};
    global.chromepool = genericPool.createPool(factory, opts);
    
    global.chromepool.on('factoryCreateError', function(err){
        debug(err);
    });
    global.chromepool.on('factoryDestroyError', function(err){
        debug(err);
    });

}

async function destroyChromePool() {
    
    // Only call this once in your application -- at the point you want to shutdown and stop using this pool.
    global.chromepool.drain().then(function() {
        global.chromepool.clear();
    });

}
Deepan Prabhu Babu
  • 862
  • 11
  • 18