0

I am trying to scrape a website that loads its content dynamically using Puppeteer. I've set it up in headless mode and even added a wait time to ensure that the content gets loaded, but I'm still not able to retrieve the desired dynamically loaded content.

Here's a snippet of my code:

async function getLinksFromBase(category) {
  urlBase = 'https://xepelin.com/blog/'
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });
  const page = await browser.newPage();

  await page.goto(urlBase + category);

  // Espera 5 segundos para asegurarse de que todo se haya cargado
  new Promise(r => setTimeout(r, 5000));

  const hrefs = await page.$$eval('a', anchors => {
    return anchors.map(anchor => anchor.href).filter(href => !!href);
  });

  await browser.close();

  return hrefs;
}

When I execute the function, the resulting hrefs don't include the links from the dynamically loaded content.

  • I've tried waiting for specific selectors using waitForSelector.

  • I've checked if there are any redirections or pop-ups, but I don't think there are any.

  • I've also tried to emulate network conditions and checked console logs for any errors, but couldn't find any clues.

The page I'm trying to scrape is https://xepelin.com/blog/empresarios-exitosos.
I'm also facing the problem of having to "push" the "load more" (Cargar más) button in order to get all the blog posts, so I can have all the data I want.

Has anyone faced a similar issue or can point out what I might be missing? Any guidance would be much appreciated! It's my first time using puppeteer to scrape data, as I've been previously using python.

Thank you.

  • Did you try running headfully? What's the point of `.filter(href => !!href);`? This just gives you a big array of booleans, pretty much useless. What results are you actually trying to get--how many links are you expecting? – ggorlen Aug 08 '23 at 14:38
  • You need to `await` all your promises... – pguardiario Aug 08 '23 at 23:55

1 Answers1

-1

Probably you should wait for network idle on load. And don't use $$eval :) It's anti-pattern.

This code works for me both for blog and empresarios-exitosos:

await page.goto("https://xepelin.com/blog/empresarios-exitosos", {waitUntil: "networkidle2"});

const hrefElements = await page.$$('a[href]')
const hrefs = await Promise.all(hrefElements.map(a => a.evaluate(el => el.attributes['href'].nodeValue)))
console.log(hrefs);

I added waitUntil, replaced your $$eval to $$ and selector from a to a[href], so you get only elements that have hrefs. And then just evaluated href nodeValue by async mapping.

Yaroslavm
  • 1,762
  • 2
  • 7
  • 15
  • How is `$$eval` an antipattern? Much better to do it in one simple step than the rather awkward, slow and unreliable method you've shown here, that involves `n` separate `evaluate` calls on a potentially stale list of ElementHandles. `networkidle` shouldn't be necessary, either. That waits for _everything_ to load, even when you only need one thing--wait for [the one thing specifically with `waitForSelector`/`waitForResponse`/`waitForFunction`](https://serpapi.com/blog/puppeteer-antipatterns/#never-using-domcontentloaded). – ggorlen Aug 08 '23 at 14:31
  • 1. it's more easy to debug 2. More stable in cases with dynamic content. 3. Doesn't fit in cases when you should wait for particular amount of elements to be rendered. In automation of async systems you don't care about speed, you need stableness and ready state. Usage of `$$eval` caused additional flakiness on at least 2 projects I worked with. And thank you for downvoting and voting for deletion because of your own opinion. – Yaroslavm Aug 08 '23 at 14:55
  • How is `$$` more stable? `$$eval` does both selection and property retrieval in one step. With `$$`, you decouple the selection from the retrieval step, so time passes between the selection and when you `evaluate` it, which is where race conditions can occur. `$$` doesn't wait for anything to render. It's not a matter of opinion, it's a matter of misinformation and not answering the question (it's unclear what OP wants, but almost certainly this isn't it). Most of your answers are low-quality spam to self-promote your questionable-looking wrapper lib, 2 upvotes out of 17 answers recently. – ggorlen Aug 08 '23 at 14:59
  • In fact, ElementHandles (the type returned by `$$`) are [deprecated in Playwright](https://playwright.dev/docs/api/class-locator#locator-element-handle): "ElementHandles [...] are inherently racy", and the same is true in Puppeteer. In Puppeteer, `$$eval` is the easiest way to avoid a race condition, fastest (any `evaluate` call is slow, so you might as well do as few as possible), and easiest to code. Flakiness in your recent projects was likely for other reasons than `$$eval`. – ggorlen Aug 08 '23 at 15:04
  • That's something I told about. Speed is not something you should care about when you automation high-loaded async systems. You care about SR of each test. If you have a function that should wait for particular amount of elements to be rendered, you can't use $$eval. When you need to debug your functions, it's easier to debug my approach. If you need Puppeteer to slow down and not to speed up, $$eval is not your choice. Also, in one project code practices should be similar. OP asked for links from dynamically loaded content - my code gets all rendered links (and stable gets same result) – Yaroslavm Aug 08 '23 at 15:11
  • `$$` is objectively worse for speed (multiple sub-process `evaluate` calls rather than one), reliability (separating selection and property access trips to the browser causes a race condition) and code quality (more code to write, harder to read). `$$` doesn't wait for a particular amount of elements to be rendered--it doesn't wait for anything any more than `$$eval` does--both run instantly, always. If you want to debug each `evaluate` or add a delay between each operation, sure, use this, but that's an uncommon use case--I've never had to resort to this sort of code for that. – ggorlen Aug 08 '23 at 15:21
  • Check out the docs for these methods: [`page.$$`](https://pptr.dev/api/puppeteer.page.__/): "The method runs `document.querySelectorAll` within the page. If no elements match the selector, the return value resolves to `[]`." [`page.$$eval`](https://pptr.dev/api/puppeteer.page.__eval): "This method runs `Array.from(document.querySelectorAll(selector))` within the page and passes the result as the first argument to the `pageFunction`.". Basically, both are just wrappers on the same thing: `querySelectorAll`, and neither wait. But with `$$eval` you avoid the extra slow `evaluates` in the loop. – ggorlen Aug 08 '23 at 15:31
  • Again, I'm matching the issue, that you often need to slow down. I know how eval works and how works $$. I don't need result that I get very fast. I need correct result. Different solutions in one project lead to situation when each member of team uses different solution. Someone waitsForElementAmountToBeMoreThan and then performs evaluate, someone uses $$eval. Sometimes you don't need to wait, so test would pass. Then something changed and now rendering slowed down. 1 implementation wouldn't fail, second - would. – Yaroslavm Aug 08 '23 at 15:38
  • I'm not talking about this case, I'm talking in general, when you project has complicated Frontend logic, like in Figma, WiX, Canva, etc. And you have the tons of tests and CI, and each test false-negative result is critical. I am just telling difference that I saw on practice with using $$eval and using waitForElementsCollectionLengthToBe on some time intervalL – Yaroslavm Aug 08 '23 at 15:42
  • You shouldn't be relying on speed one way or the other, you should be relying on predicates. The goal is to take timing out of the picture as much as possible. When the predicate becomes true based on a `waitForX` (other than `waitForTimeout`, which is poor practice), you want to immediately take action. If some property doesn't exist, wait for it in an event-driven manner. Introducing timeouts doesn't increase reliability, it decreases it. If waiting based on timing was good, Puppeteer wouldn't have deprecated `waitForTimeout`. Separating `$$` and `evaluate` is basically a small timeout. – ggorlen Aug 08 '23 at 15:44
  • So, well, in ideal world your arguments are correct. In world of practice that not always works. This is very long discussion and I'm not sure we can come to consensus. I got your points, I understand, why you have it, for you it works - it's OK. From my experience your approach cost big amount of money in companies I worked in, and my approach in similar cases reduced that amount on practical results. So I wish our approaches would both work for us. – Yaroslavm Aug 08 '23 at 15:54
  • Don't trust me, trust the [Playwright](https://playwright.dev/docs/api/class-page#page-query-selector-all) and [Puppeteer](https://github.com/puppeteer/puppeteer/tree/7748730163bc1a14cbb30881809ea529844f887e?ref=serpapi.com#q-is-puppeteer-replacing-seleniumwebdriver) docs: "Puppeteer has event-driven architecture, which removes a lot of potential flakiness. There’s no need for evil “sleep(1000)” calls in puppeteer scripts.". Adding arbitrary delays between `page.$$` and `page.evaluate` is essentially the same as `sleep(1000)`, except with a shorter duration. More random delays, more flakiness. – ggorlen Aug 08 '23 at 16:32
  • `networkidle2`, same thing--it's basically a sleep for 500 ms after all except the last two requests end, a rough estimate of when elements will be ready, not a real predicate. Sleeping can have value from time to time and is useful for debugging, but it's a last resort move when you can't wait for something in a more deterministic, event-driven way. – ggorlen Aug 08 '23 at 16:34
  • 1
    See also [page.evaluate Vs. Puppeteer $ methods](https://stackoverflow.com/questions/55664420/page-evaluate-vs-puppeteer-methods). Answer is written by Thomas Dondorf, probably the most experienced Puppeteer writer and author of puppeteer-cluster, probably the best library in the Puppeteer ecosystem. In that answer, handles have some purpose, though, because of the trusted `.click()`. – ggorlen Aug 08 '23 at 16:37
  • 1
    BTW, Puppeteer is introducing [locators](https://pptr.dev/guides/locators), which will probably make handles, `$$` and `$$eval` all obsolete, if it's anything like Playwright. – ggorlen Aug 08 '23 at 17:11