I am currently working on a web scraper with Puppeteer that is supposed to get the details of single articles (think Amazon).
So far, I have been getting Element Handles for all the article elements and subsequently using $eval to get each element that I need in a structured way. Scraping 400 elements this way takes anywhere between 12s and 15s.
I have now tried to just print the innerHTML of all the 400 handles using $$eval, which only took 5s. I was thinking to use an HTML parser on these.
My question is: Is $eval on each element handle that I require (5 handles per element) a better practice than just parsing HTML ?
This is how I get all article element handles and loop through each of them -> 12s - 15s runtime
//getting element handles for articles
let articleHandles = await page.$$('.SearchResult_searchResultItemWrapper__VVVnZ')
//looping through handles to get info and push to array
for (handle of articleHandles) {
const articleLink = await handle.$eval('a', e => e.getAttribute('href'));
const articleImg = await handle.$eval('a>div>span>img', e => e.getAttribute('src'));
const articleDesc = await handle.$eval('div>div', e => e.innerText);
const dealerHandle = await handle.$('div.SearchResult_companyWrapper__W5gTQ', e => e.innerText);
const dealer = dealerHandle ? await dealerHandle.evaluate(e => e.innerText) : "Not found";
let detailsOne = (cleanupArticleDescription(articleDesc));
let detailsTwo = {
link: articleLink,
img: "https://www.****" + articleImg,
dealer: dealer.substring(dealer.indexOf(' ') + 1)
}
const article = { ...detailsOne, ...detailsTwo }
articlesAll.push(article);
}
This is how I get all innerHTML and print them -> 5s runtime
let articleHandles = await page.$$eval('.SearchResult_searchResultItemWrapper__VVVnZ', e => { return e.map(a => a.innerHTML) })
for (handle of articleHandles) {
console.log(handle)
}
My goal is to scrape through roughly 1 million articles, so performance is the most important aspect.
Some side info:
- The page uses infinite scroll; I scroll 10 times to load 400 elements
- I can switch pages using URL params as well but I want to avoid overhead, since I am using bright data's scraping browser and paying for traffic. Currently I skip 10 pages once the scraping on 400 elements is done (Going from page 1 to page 11). The page limits the amount of items loaded on infinite scroll anyway; I found loading a new page with URL params every 10 pages to be the best of both worlds performance wise.