I am trying to scrape a website that loads its content dynamically using Puppeteer. I've set it up in headless mode and even added a wait time to ensure that the content gets loaded, but I'm still not able to retrieve the desired dynamically loaded content.
Here's a snippet of my code:
async function getLinksFromBase(category) {
urlBase = 'https://xepelin.com/blog/'
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
await page.goto(urlBase + category);
// Espera 5 segundos para asegurarse de que todo se haya cargado
new Promise(r => setTimeout(r, 5000));
const hrefs = await page.$$eval('a', anchors => {
return anchors.map(anchor => anchor.href).filter(href => !!href);
});
await browser.close();
return hrefs;
}
When I execute the function, the resulting hrefs don't include the links from the dynamically loaded content.
I've tried waiting for specific selectors using
waitForSelector
.I've checked if there are any redirections or pop-ups, but I don't think there are any.
I've also tried to emulate network conditions and checked console logs for any errors, but couldn't find any clues.
The page I'm trying to scrape is https://xepelin.com/blog/empresarios-exitosos.
I'm also facing the problem of having to "push" the "load more" (Cargar más) button in order to get all the blog posts, so I can have all the data I want.
Has anyone faced a similar issue or can point out what I might be missing? Any guidance would be much appreciated! It's my first time using puppeteer to scrape data, as I've been previously using python.
Thank you.