0

I am scraping a website and I realised the content that I need don't load onto the page initially until a click event is triggered on an element.

Obviously I didn't build the website, I have no idea how the page content actually gets updated besides clicking on the element.

Currently, my steps are:

  1. make a GET request to the url
  2. load the response data into cheerio

I wonder if I can do something like,

  1. make a get request to the url endpoint,
  2. trigger a click event with jquery or cheerio on the element somehow,
  3. then reload it into cheerio.

(note: the url doesn't change after the element is clicked)

Eddie Lam
  • 579
  • 1
  • 5
  • 22
  • 2
    While cheerio is good for scraping static website content, I think it will be better if you try to use puppeteer or other headless solution since you are dealing with an SPA. Clicking and navigation through such an app is always easier with a headless browser. – Abrar Hossain Feb 13 '21 at 02:58
  • nice - I came across JSDOM as well and am wondering what's the difference between the two modules and which is is better? – Eddie Lam Feb 13 '21 at 07:57
  • I have not used JSDOM before but I looked at the Github repo. It seems like separate modular implementations of various browser components. While you can use it to do what you want in this case, personally it seems like a lot of setup will be needed to make it use like a headless browser. My experience with Pupepeteer has been great. It's very easy to load a site and start injecting plain JS code to interact with the application. Only complain is that Puppeteer is memory intensive and really does not work on slow machines. – Abrar Hossain Feb 13 '21 at 08:39
  • Word. I am trying to click an element and it's taking very long for the waitForNavigation function to finish; I have set the timeout wait option to 0 to allow for longer wait load time, but it's taking what feels like forever, any experience you can share for dealing with such scenario? – Eddie Lam Feb 13 '21 at 08:47
  • `page.waitForNavigation` takes two arguments: the selector and an options. For options, try passing `{timeout: 30000, waitUntil: 'networkidle2'}` . This will first set timeout to 30s and also wait for network connections to become idle. Sometimes if the selector is not found, this gets stuck. You can use a "sleep" to get past this (I have done it too many times!). Eg: `async function sleep(ms) { await new Promise(r => setTimeout(r, ms)); }`. Then you can use it like this `await sleep(5000)` to wait for 5s. Usually, 5-10s delay after click or other events suffices. – Abrar Hossain Feb 13 '21 at 08:55
  • Cheerio is a static HTML parser that doesn't run JS. – ggorlen Nov 26 '22 at 02:25

0 Answers0