1

I'm not too familiar with advanced javascript and looking for some guidance. I'm looking to store webpage content into DB using puppeteer-cluster Here's a starting example:

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
  });

  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    const screen = await page.content();
    // Store content, do something else
  });

  cluster.queue('http://www.google.com/');
  cluster.queue('http://www.wikipedia.org/');
  // many more pages

  await cluster.idle();
  await cluster.close();
})();

Looks like I may have to use pg addon to connect to db. What would be the recommended approach to this?

Here's my table:

+----+-----------------------------------------------------+---------+
| id | url                                                 | content |
+----+-----------------------------------------------------+---------+
| 1  | https://www.npmjs.com/package/pg                    |         |
+----+-----------------------------------------------------+---------+
| 2  | https://github.com/thomasdondorf/puppeteer-cluster/ |         |
+----+-----------------------------------------------------+---------+

I believe I'd have to pull data into an array (id & url), and after each time content is received, store it into the DB (by id & content).

Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105
sojim2
  • 1,245
  • 2
  • 15
  • 38
  • I would be executing this script with `node` on server to get website content. For example `node example.js` and `example.js` would be the script above. https://github.com/GoogleChrome/puppeteer/blob/master/README.md – sojim2 Mar 20 '19 at 18:20

1 Answers1

1

You should create a database connection outside of the task function:

const { Client } = require('pg');
const client = new Client(/* ... */);
await client.connect();

Then you query the data and queue it (with the ID to be able to save it in the database later on):

const rows = await pool.query('SELECT id, url FROM your_table WHERE ...');
rows.forEach(row => cluster.queue({ id: row.id, url: row.url }));

And then, at the end of your task function, you update the table row.

await cluster.task(async ({ page, data: { id, url, id } }) => {
    // ... run puppeteer and save results in content variable
    await pool.query('UPDATE your_table SET content=$1 WHERE id=$2', [content, id]);
});

In total, your code should look like this (be aware, that I have not tested the code myself):

const { Cluster } = require('puppeteer-cluster');
const { Client } = require('pg');

(async () => {
    const client = new Client(/* ... */);
    await client.connect();

    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 2,
    });

    await cluster.task(async ({ page, data: { id, url } }) => {
        await page.goto(url);
        const content = await page.content();
        await pool.query('UPDATE your_table SET content=$1 WHERE id=$2', [content, id]);
    });

    const rows = await pool.query('SELECT id, url FROM your_table');
    rows.forEach(row => cluster.queue({ id: row.id, url: row.url }));

    await cluster.idle();
    await cluster.close();
})();
Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105