0

So I was writing a web scraping application using cheerio.js. Things was going well until I noticed that cheerio $('tbody tr') return nothing, while when I open the same website in chrome, jquery $('tbody tr') return all the rows in table body. In cheerio's body, there is no tbody. The structure is like <table><theader></theader><tr></tr><tr></tr></table>. Did Chrome make this change? Did cheerio passed the HTML response incorrectly?

Nam Thai
  • 841
  • 1
  • 10
  • 26
  • Yes, Chrome made this change. Cheerio operates on source code while jQuery in Chrome operates on the source code's view. Two different DOMs – xmojmr Aug 17 '15 at 04:54
  • @xmojmr can you explain a bit more please? What are the name/type/category of each DOM? I just hope to be aware of all discrepancies for future reference. – Nam Thai Aug 17 '15 at 05:29

1 Answers1

0

Following 3 html code snippets look the same when rendered by the html browser, yet the original code is slightly different.

  1. no thead no tbody in source code

    <table><tr><td>row1</td></tr><tr><td>row2</td></tr></table>
  2. no tbody in source code

    <table><thead></thead><tr><td>row1</td></tr><tr><td>row2</td></tr></table>
  3. tbody and no thead in source code

    <table><tbody><tr><td>row1</td></tr><tr><td>row2</td></tr></tbody></table>

According to w3schools.com browsers can use the thead, tbody, tfoot elements to enable scrolling of the table body independently of the header and footer.

Browsers can also optimize, normalize or modify DOM before using it for display, as long as the used DOM renders as intended.

In your case, the cheerio parser reads some source code (result of node.js request) as-is and creates its in-memory DOM representation which you can traverse/modify later.

While jQuery when run by the browser reads the normalized and optimized DOM parsed and processed by the html browser.

While the 2 DOMs may be different, they will look the same when presented to the user so it is not bug, it is a feature

xmojmr
  • 8,073
  • 5
  • 31
  • 54