0

I have a Puppeteer script where it iterates through a list of URLs saved in urls.txt to scrape. I have 2 issues:

  1. If one of the URLs in the list times out, it stops the whole process. I would like it to skip any URLS that don't work / timeout, and just move on to the next URL. I have tried to put in a catch(err), but I'm not putting it in correctly and it fails.

  2. If the list of URLs is more than about 5, it freezes my server and I have to reboot. I think maybe it's waiting to iterate through all the URLs before saving and that's overloading the server? Or is there something else in my code that is causing the problem?

const puppeteer = require('puppeteer');
const fs = require('fs');
const axios = require('axios');

process.setMaxListeners(Infinity); // <== Important line

async function scrapePage(url, index) {
  // Launch a new browser
  const browser = await puppeteer.launch({ headless: true, args: ['--no-sandbox'] });

  // Open a new page
  const page = await browser.newPage();

  // Set the user agent
  await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36');

  // Navigate to the desired webpage
  await page.goto(url, {
    waitUntil: "domcontentloaded",
  });

  // Wait for selector
  await (async () => {
    await page.waitForSelector("#root > section > section > main > div.py-6.container > div.columns.mt-4 > div.column.is-flex-grow-2 > div:nth-child(3) > div.ant-card-body > div > div > div > canvas", { visible: true });
  })();

  // Get the HTML content of the page
  const html = await page.content();

  // Generate the file name using the index value
  const htmlFileName = `${index.toString().padStart(4, '0')}.html`;
  const screenshotFileName = `${index.toString().padStart(4, '0')}.png`;

  // Check if the HTML file exists
  const filePath = '/root/Dropbox/scrapes/' + htmlFileName;
  if (fs.existsSync(filePath)) {
    // If the file exists, rewrite the content with the new scraped HTML
    fs.writeFileSync(filePath, html);
  } else {
    // If the file doesn't exist, create the file
    fs.closeSync(fs.openSync(filePath, 'w'));

    // Save the scraped content to the newly created file
    fs.writeFileSync(filePath, html);
  }

  // Capture a screenshot of the page
  await page.screenshot({ path: '/root/scrapes/' + screenshotFileName });


  // Close the browser
  await browser.close();
} 

// Read the lines of the file
const lines = fs.readFileSync('/root/Dropbox/urls.txt', 'utf-8').split('\n');

// Iterate through each URL in the file
for (let i = 0; i < lines.length; i++) {
  // Scrape the page
  scrapePage(lines[i], i + 1);
}
Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197

1 Answers1

0

Url's opening/closing too quickly and not waiting is probably the problem that you have while iterating through the urls, changing scrapePage(lines[i], i + 1); to await scrapePage(lines[i], i + 1); should solve that. Also page.waitForSelector doesn't need to be in (async () => {..}, the way you have it in your code.

To check if the URL's work or not you need to get the response from the page.goto() if that is 200 (HTTP status code) it means ok. On try...catch, take a look at the code below.

urls.txt - random urls to test the code - stackover5flow is there to get an error

https://stackoverflow.com/questions/75911774/error-during-ssl-handshake-with-remote-server-node-js-with-apache
https://stackoverflow.com/questions/75911761/unit-tests-of-private-function-in-javascript
https://stackoverflow.com/questions/75911767/is-r-language-more-simplified-to-use-than-sql
https://stackoverflow.com/questions/75911766/is-there-an-equivalent-of-term-variable-on-windows
https://stackover5flow.com/questions/75911176/puppeteer-timeout-30000ms-when-headless-is-true
https://stackoverflow.com/questions/75909981/how-can-i-elevate-the-privileges-of-an-executable-using-setuid-on-mac
https://stackoverflow.com/questions/75911746/function-that-returns-an-array-of-4-int-taking-the-values-0-or-1-and-that-rando

code :

const puppeteer = require('puppeteer');
const fs = require('fs');
const fsp = fs.promises;

process.setMaxListeners(Infinity); // <== Important line

let browser;
(async () => {

async function scrapePage(url, index, timeout = false) { // Launch a new browser
    const browser = await puppeteer.launch({headless: false, args: ['--no-sandbox']});
    const page = await browser.newPage();
    // Set the user agent
    await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36');

    try {
        let res = await page.goto(url, {waitUntil: "domcontentloaded", timeout: 20000});
        res = res.status();
        
        if (res == 200) { // test if url can connect or not
            let selector = (timeout) ? "body[class=potato]" : "div.container"; // just to test timeout

            await page.waitForSelector(selector, {visible: true, timeout : 5000}); // timeout set to 5 secs, change if required
            const html = await page.content();
            const htmlFileName = `${index.toString().padStart(4, '0')}.html`;
            const screenshotFileName = `${index.toString().padStart(4, '0')}.png`;

            const dir = 'test/scrapes';
            const filePath = `${dir}/${htmlFileName}`;
            const screenshotPath = `${dir}/${screenshotFileName}`;

            await fsp.mkdir(dir, { recursive: true }, (e) => { if (e) console.log(e);}); // chk dir, if doesn't exist create
            await fsp.writeFile(filePath, html, { flag: 'w+' }, (e) => { if (e) throw e;}); // write/overwrite or throw error will stop script
            
            await page.screenshot({ path: screenshotPath});

        } else {
            console.log (res); 
        }

    } catch (e) { // this will catch any timeout or connection error
        console.log(e.message); 
    }
    
    await browser.close();
}

// Read
const lines = (await fsp.readFile('urls.txt', 'utf8')).split('\n');

for (let i = 0; i < lines.length; i++) {  
    let timeout =  (i == 1) ? true : false; // just to generate a timeout
    await scrapePage(lines[i], i + 1, timeout);
}

})().catch(err => console.error(err)). finally(() => browser ?. close());

result : so it will give errors but will run unless that error is while writing/overwriting the files.

idchi
  • 761
  • 1
  • 5
  • 15