I've been using colly for some simple web scraping tasks. It works fine for most of the cases where the web page layouts are consistent or for simple logic (e.g. a lot of existing examples and projects are "here's how you find the second table")
I'm trying to do more context-aware scraping in order to enrich the results. For example take the following representative webpage layout
<h2>Beans Table</h2>
<table class="myTableClass">
<tr> <td></td><td></td><td></td><td></td> </tr>
<tr> <td></td><td></td><td></td><td></td> </tr>
<tr> <td></td><td></td><td></td><td></td> </tr>
</table>
<h2>Rice Table</h2>
<table class="myTableClass">
<tr> <td></td><td></td><td></td><td></td> </tr>
</table>
If I wanted to grab every element in all myTableClass tables I could do something like this:
c.OnHTML(".myTableClass tr", func(e *colly.HTMLElement) {
qoquerySelection := e.DOM
qoquerySelection.Find("td").Each(func(i int, s *goquery.Selection) {
fmt.Printf("%d, Cell value: %s\n", i, s.Text())
})
})
Or if I wanted to find the headings above tables I could do this:
c.OnHTML("h2", func(e *colly.HTMLElement) {
if strings.Contains(e.Text, "Beans") {
log.Println("Beans table follows")
log.Println(qoquerySelection.Html())
}
})
But I don't see an easy way to correlate "this table is under this heading". The index values and etc. returned as part of colly's objects are all relative post-parse, and the goquery APIs also look slanted towards "iterate all of these tags for me".
I have a partial solution right now by pulling colly.HTMLElement.DOM.Html()
as part of the request/initialization and trying to map positional awareness there using string matching but that doesn't seem very clean - is there a supported way to maintain positional awareness when iterating a webpage?