1

I'm running into trouble with scraping items that don't have a single root. Something that is necessary I believe with x-ray

Consider scraping hacker news where each headline is made up of two TRs:

<tbody>
  <tr class="athing>content item 1</tr>
  <tr>content item 1</tr>
  <tr class="spacer></tr>
  <tr class="athing>content item 2</tr>
  <tr>content item 2</tr>
  <tr class="spacer></tr>
</tbody>

As can be seen, there's no common root-node per item.

Does x-ray support scraping in such a case?

Geert-Jan
  • 18,623
  • 16
  • 75
  • 137

1 Answers1

0

you could use + to select sibling

x(html, 'tbody ',
    ['tr.athing, tr.athing+tr:not(.athing):not(.spacer)']
)
(function (err, res) {
    console.log(res)
})

result:

[ 'content item 1a',
  'content item 1b',
  'content item 2a',
  'content item 2b' ]
frustrum
  • 341
  • 3
  • 7