13

According to https://github.com/GoogleChrome/puppeteer/issues/628, I should be able to get all links from < a href="xyz" > with this single line:

const hrefs = await page.$$eval('a', a => a.href);

But when I try a simple:

console.log(hrefs)

I only get:

http://example.de/index.html

... as output which means that it could only find 1 link? But the page definitely has 12 links in the source code / DOM. Why does it fail to find them all?

Minimal example:

'use strict';
const puppeteer = require('puppeteer');

crawlPage();

function crawlPage() {
    (async () => {
 
 const args = [
            "--disable-setuid-sandbox",
            "--no-sandbox",
            "--blink-settings=imagesEnabled=false",
        ];
        const options = {
            args,
            headless: true,
            ignoreHTTPSErrors: true,
        };

 const browser = await puppeteer.launch(options);
        const page = await browser.newPage();
 await page.goto("http://example.de", {
            waitUntil: 'networkidle2',
            timeout: 30000
        });
     
 const hrefs = await page.$eval('a', a => a.href);
        console.log(hrefs);
  
        await page.close();
 await browser.close();
  
    })().catch((error) => {
        console.error(error);
    });;

}
Grant Miller
  • 27,532
  • 16
  • 147
  • 165
Vega
  • 2,661
  • 5
  • 24
  • 49

2 Answers2

39

In your example code you're using page.$eval, not page.$$eval. Since the former uses document.querySelector instead of document.querySelectorAll, the behaviour you describe is the expected one.

Also, you should change your pageFunctionin the $$eval arguments:

const hrefs = await page.$$eval('a', as => as.map(a => a.href));
Miguel Calderón
  • 3,001
  • 1
  • 16
  • 18
  • If I use page.$$eval I get "undefined" as output. – Vega Mar 26 '18 at 13:03
  • 1
    Thank you very much, that works. Does that mean that the code example on the Github page is wrong? – Vega Mar 26 '18 at 13:07
  • 3
    It's 'wrong' in the sense that it doesn't work that way right now, but if you reread the issue you'll see that that example code was a bit of a "proof of concept" example of how would `$$eval` work once implemented (now it is implemented, and it works just a bit differently). – Miguel Calderón Mar 26 '18 at 13:10
  • Some links might not have an `href` attribute, so `a[href]` seems like a more accurate selector. – ggorlen Sep 14 '22 at 17:34
5

The page.$$eval() method runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to the page function.

Since a in your example represents an array, you will either need to specify which element of the array you want to obtain the href from, or you will need to map all of the href attributes to an array.

page.$$eval()

const hrefs = await page.$$eval('a', links => links.map(a => a.href));

Alternatively, you can also use page.evaluate() or a combination of page.$$(), elementHandle.getProperty(), or jsHandle.jsonValue() to achieve an array of all links from the page.

page.evaluate()

const hrefs = await page.evaluate(() => {
  return Array.from(document.getElementsByTagName('a'), a => a.href);
});

page.$$() / elementHandle.getProperty() / jsHandle.jsonValue()

const hrefs = await Promise.all((await page.$$('a')).map(async a => {
  return await (await a.getProperty('href')).jsonValue();
}));
Grant Miller
  • 27,532
  • 16
  • 147
  • 165