0

I am currently working on a web scraper with Puppeteer that is supposed to get the details of single articles (think Amazon).

So far, I have been getting Element Handles for all the article elements and subsequently using $eval to get each element that I need in a structured way. Scraping 400 elements this way takes anywhere between 12s and 15s.

I have now tried to just print the innerHTML of all the 400 handles using $$eval, which only took 5s. I was thinking to use an HTML parser on these.

My question is: Is $eval on each element handle that I require (5 handles per element) a better practice than just parsing HTML ?

This is how I get all article element handles and loop through each of them -> 12s - 15s runtime

            //getting element handles for articles
            let articleHandles = await page.$$('.SearchResult_searchResultItemWrapper__VVVnZ')


            //looping through handles to get info and push to array
            for (handle of articleHandles) {

                const articleLink = await handle.$eval('a', e => e.getAttribute('href'));
                const articleImg = await handle.$eval('a>div>span>img', e => e.getAttribute('src'));
                const articleDesc = await handle.$eval('div>div', e => e.innerText);
                const dealerHandle = await handle.$('div.SearchResult_companyWrapper__W5gTQ', e => e.innerText);

                const dealer = dealerHandle ? await dealerHandle.evaluate(e => e.innerText) : "Not found";

                let detailsOne = (cleanupArticleDescription(articleDesc));

                let detailsTwo = {
                    link: articleLink,
                    img: "https://www.****" + articleImg,
                    dealer: dealer.substring(dealer.indexOf(' ') + 1)
                }

                const article = { ...detailsOne, ...detailsTwo }

                articlesAll.push(article);
            }

This is how I get all innerHTML and print them -> 5s runtime

let articleHandles = await page.$$eval('.SearchResult_searchResultItemWrapper__VVVnZ', e => { return e.map(a => a.innerHTML) })
            for (handle of articleHandles) {
                console.log(handle)
            }

My goal is to scrape through roughly 1 million articles, so performance is the most important aspect.

Some side info:

  • The page uses infinite scroll; I scroll 10 times to load 400 elements
  • I can switch pages using URL params as well but I want to avoid overhead, since I am using bright data's scraping browser and paying for traffic. Currently I skip 10 pages once the scraping on 400 elements is done (Going from page 1 to page 11). The page limits the amount of items loaded on infinite scroll anyway; I found loading a new page with URL params every 10 pages to be the best of both worlds performance wise.
EnKaya
  • 1
  • 1
  • Try running the equivalent script in a browser's console (with querySelectorAll, console.time, etc). Anything over that measure is an overhead – Dimava Jul 17 '23 at 19:58
  • As you have infinity scroll to load extra items it may be better ti make a fetch handler to directly parse acquired api data – Dimava Jul 17 '23 at 19:59
  • @Dimava - you mean like shown here [in the Service Worker API docs](https://developer.mozilla.org/en-US/docs/Web/API/FetchEvent#examples:~:text=Instance%20methods-,Examples,-Specifications) ? – EnKaya Jul 18 '23 at 08:13
  • More like https://stackoverflow.com/questions/45822058/puppeteer-how-to-listen-to-a-specific-response – Dimava Jul 18 '23 at 16:09
  • https://stackoverflow.com/a/56839061/5734961 answer to be exact – Dimava Jul 18 '23 at 16:10

0 Answers0