4

Using Node.js, Chrome and puppeteer as headless on ubuntu server, I'm scraping a few different websites. One of the occasional task is to interact with the loaded page (click on a link to open another page and then possibly do another click to accept the terms and such).

I can do all this just fine, but I'm trying to understand how it will work if I have multiple pages open simultaneously and am trying to interact with different loaded pages at the same time (overlapping times).

To visualize this, I'm thinking how a user will do the same job. They'll have to open multiple browser windows, open the page and switch between them to see and then click on links.

But using puppeteer, we have separate browser object, we don't need to see the window or page to know where to click. We can traverse it through the browser object and then do a click on desired element without looking (headless).

I'm thinking I should be able to do multiple pages at the same time as long as I have CPU and memory available to handle them.

Does anyone have any experience with puppeteer interacting with multiple websites simultaneously? Anything I need to watch out for?

Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105
Curious101
  • 1,538
  • 2
  • 18
  • 36
  • 1
    I tried to simultanously scrape pages with puppeteer but I was running out of memory too soon. I opened lot of urls with a ˙forEach`, it killed my PC instantly . Then I tried a bulk of 10 urls per once, it was neither good: a lot of tries resulted in timeout. So now I am going through 3000+ urls sequentially which still has some issues, e.g.: https://stackoverflow.com/questions/62220867/puppeteer-chromium-instances-remain-active-in-the-background-after-browser-disc. If you have a massive hardwere behind your ubuntu server, you may have success. – theDavidBarton Jun 08 '20 at 11:44
  • 2
    I am considering using [puppeteer-cluster](https://github.com/thomasdondorf/puppeteer-cluster) next time. – theDavidBarton Jun 08 '20 at 11:44
  • Thanks @theDavidBarton, I'll look into puppeteer-cluster. I see the author of puppeteer-cluster has posted an answer. I'll respond there as well. – Curious101 Jun 08 '20 at 17:11

2 Answers2

4

This is the problem the library puppeteer-cluster (I'm the author) is addressing. It allows you to build a pool of pages (or browsers) to use and run tasks inside.

You find several general code samples in the repository (and also on stackoverflow). Let me address your specific use case of running different tasks with an example.

Code Sample

The following code creates two tasks:

  • crawl: Opens the page and extracts an URL to then start the second task
  • screenshot: Takes a screenshot of the extracted URL

The process is started by queuing the crawl task with the URLs.

const { Cluster } = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({ // use four pages in parallel
        concurrency: Cluster.CONCURRENCY_PAGE,
        maxConcurrency: 4,
    });

    // We define two tasks
    const crawl = async ({ page, data: url }) => {
        await page.goto(url);
        const extractedURL = /* ... */; // extract an URL (or multiple) from the document somehow
        cluster.queue(extractedURL, screenshot);
    };

    const screenshot = async ({ page, data: url }) => {
        await page.goto(url);
        await page.screenshot();
    };

    // Crawl some pages
    cluster.queue('https://www.google.com/', crawl);
    cluster.queue('https://github.com/', crawl);

    // Wait until everything is done and close the cluster
    await cluster.idle();
    await cluster.close();
})();

This is a minimal example. I left out error handling, monitoring and the setup options.

Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105
  • Thanks @Thomas-dondorf. That looks like a very useful library. Appreciate your time and effort in creating it and posting about it. I'll add this and test it out and report back. – Curious101 Jun 08 '20 at 18:00
1

I can usually get 5 or so browsers going on a 4GB server, if you're just popping urls off a queue it's pretty straightforward:

const puppeteer = require('puppeteer');

let queue = [
  'http://www.amazon.com',
  'http://www.google.com',
  'http://www.fabebook.com',
  'http://www.reddit.com',
]

const doQueue = async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  let url
  while(url = queue.shift()){
    await page.goto(url)
    console.log(await page.title())
  }
  await browser.close()
}

[1,2,3].map(() => doQueue())
pguardiario
  • 53,827
  • 19
  • 119
  • 159