4

I am looking for a way to efficiently scrape information formatted in the following way using puppeteer. Suppose I have a list of things on a website divided as such:

<div id="list">
  <div class="item" pos="0"> 
  <a href="www.somewebsite.com">
    <div class="nameToRetrieve"> Name 1 </div>
  </div>
  <div class="item" pos="1"> 
  <a href="www.somewebsite.com">
    <div class="nameToRetrieve"> Name 2 </div>
  </div>
  <div class="item" pos="2"> 
  <a href="www.somewebsite.com">
    <div class="nameToRetrieve"> Name 3 </div>
  </div>
</div>

How can I retrieve the information of the names (Name 1, Name 2 and Name 3?

I have tried fitting them into an object to make then into an array, but I am still confused as to how to approach it.

const listOfStuff = document.getElementById('list').getElementsByClassName('itemResult')
Eddie
  • 26,593
  • 6
  • 36
  • 58
pam
  • 113
  • 1
  • 10

2 Answers2

4

There is a special convenience method page.$$eval for this task in puppeteer:

let result = await page.$$eval('.nameToRetrieve', names => names.map(name => name.textContent));
console.log(result);

This method runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction.

The result will be:

[ ' Name 1 ', ' Name 2 ', ' Name 3 ' ]

Vaviloff
  • 16,282
  • 6
  • 48
  • 56
2

Not much to do with the puppeteer API I think. On modern browsers (ES6) converting to an array is elegant, and then just map it. Note I assumed nameToRetrieve only appears in stuff you want to retrieve, so no need to get the "list".

var names = Array.from(document.getElementsByClassName("nameToRetrieve")).map(x => x.innerHTML);
console.log(names)
<div id="list">
  <div class="item" pos="0"> 
  <a href="www.somewebsite.com">
    <div class="nameToRetrieve"> Name 1 </div>
  </div>
  <div class="item" pos="1"> 
  <a href="www.somewebsite.com">
    <div class="nameToRetrieve"> Name 2 </div>
  </div>
  <div class="item" pos="2"> 
  <a href="www.somewebsite.com">
    <div class="nameToRetrieve"> Name 3 </div>
  </div>
</div>
kabanus
  • 24,623
  • 6
  • 41
  • 74