26

I have a scraping algorithm in nodejs with puppeteer which scrapes 5 pages concurrently and when it finishes with one page it pulls the next url from a queue and open it in the same page. The CPU is always at 100%. How to make puppeteer use less cpu?

This process is running on a digitaloceans droplet with 4gb of RAM and 2 vCPUs.

I've launched the puppeteer instance with some args to try to make it lighter but nothing happened

 puppeteer.launch({
    args: ['--no-sandbox', "--disable-accelerated-2d-canvas","--disable-gpu"],
    headless: true,
  });

Are there any other args I can give to make it less CPU hungry?

I've also blocked images loading

await page.setRequestInterception(true);
page.on('request', request => {
  if (request.resourceType().toUpperCase() === 'IMAGE')
    request.abort();
  else
    request.continue();
});
Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Pjotr Raskolnikov
  • 1,558
  • 2
  • 15
  • 27

3 Answers3

31

my default args, please test it and tell me if this run smoothly. Please note that --no-sandbox isn't secure when navigate to vulnerable sites, but it's OK if you're testing your own sites or apps. So make sure, you're know what you're doing.

  const options = {
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-accelerated-2d-canvas',
      '--no-first-run',
      '--no-zygote',
      '--single-process',
      '--disable-gpu'
    ],
    headless: true
  }

  return await puppeteer.launch(options)
Edi Imanto
  • 2,119
  • 1
  • 11
  • 17
2

There's a few factors that can play into this. First, check if the site(s) that you're visiting using a lot of CPU. Things like canvas and other scripts can easily chew through your CPU, especially when it comes to using canvas.

If you're using docker to do your deployment then make sure you use dumb-init. There's a nice repo here that goes into why you'd use such a thing, but essentially the process ID that gets assigned in your docker image has some hiccups when it comes to handling termination:

EXPOSE 8080

ENTRYPOINT ["dumb-init", "--"]
CMD ["yarn", "start"]

This is something I've witnessed and fixed on browserless.io as I use docker to handle deployments, CPU usage being one of them.

browserless
  • 2,090
  • 16
  • 16
0

To avoid parallel execution which causes high CPU usage , i had to execute jobs sequentially using

p-iteration NPM package. In my case it's not a problem because my jobs don't take too much time.

You can use either forEachSeries or mapSeries function depending on you scenario.