1

I got a list (ul) from a website and now I want to loop over the children and their text i.e.

<ul>
  <li>
    <span>some text</span>
  </li>
  <li>
    <span>some text 2</span>
  </li>
  <li>
    <span>some text 3</span>
  </li>
  <li>
    <span>some text 4</span>
  </li>
</ul>

When I print the outcome of the main node, it says ChildNodeCount:4 Children:[]. The childNodeCount is correct, but the children is empty and thus I cannot loop through the children to retrieve the text.

A page has multiple lists, so what I basically want is a list with "UL" elements so I can loop through each UL element, and within that UL element through its LI children.

Anyone knows what I am doing wrong?

chromedp.Nodes(`.product-item__content ul.product-small-specs`, &specs, chromedp.AtLeast(0)),

Also a small side-question. If I have an slice of strings (URL's) and I would like to crawl them one-by-one. How would I do that? Or let me put it this way. If I got to page "A" and I find 20 links, how can I automatically check those links too and if there are links found visit those too?

I tried this code which results in an error:

exception "Uncaught" (1:54): TypeError: this.getClientRects is not a function at Text.text (:2:55)

maxGoroutines := 1
guard := make(chan struct{}, maxGoroutines)

for i := range links {
    guard <- struct{}{}

    go func(n int) {
        retrieveDetails("https://www.bol.com" + links[i].AttributeValue("href"))
        time.Sleep(5 * time.Second)

        <-guard
    }(i)
}

func retrieveDetails(url string) {
    opts := append(chromedp.DefaultExecAllocatorOptions[:],
        chromedp.Flag("headless", false),
    )
    actx, acancel := chromedp.NewExecAllocator(context.Background(), opts...)
    defer acancel()

    ctx, cancel := chromedp.NewContext(
        actx,
        chromedp.WithLogf(log.Printf),
    )
    defer cancel()

    ctx, cancel = context.WithTimeout(ctx, 6000*time.Second)
    defer cancel()

    var header string

    err := chromedp.Run(ctx,
        emulation.SetUserAgentOverride("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0"),
        chromedp.ResetViewport(),
        chromedp.Navigate(url),
        chromedp.Sleep(1*time.Second),
        chromedp.Click("#js-first-screen-accept-all-button"),
        chromedp.WaitVisible(`.product-image`),
        chromedp.Text("h1", &header, chromedp.AtLeast(0)),
        chromedp.Stop(),
    )

    fmt.Println(header)

    if err != nil {
        fmt.Println(err)
    }
}

1 Answers1

0

It seems that this post ask several unrelated questions.

chromedp says node has children but none are there

I will quote a comment from https://github.com/chromedp/chromedp/issues/632#issuecomment-654213589 :

Nodes are only obtained from the browser on an on-demand basis. If we always held the entire DOM node tree in memory, our CPU and memory usage in Go would be far higher.

See also https://github.com/chromedp/chromedp/issues/761.

TypeError: this.getClientRects is not a function at Text.text

The default query option is chromedp.BySearch, which will return non-dom-element nodes. Since you provide a css selector and want to select a dom-element, you can change the code like this:

- chromedp.WaitVisible(`.product-image`),
+ chromedp.WaitVisible(`.product-image`, chromedp.ByQuery),
Zeke Lu
  • 6,349
  • 1
  • 17
  • 23
  • Yeah I figured that, but is there no way to get them anyways but within the UL? if I would get the "LI" separately then it will not be possible for me to match them against a parent since they all have different amounts of li's. I will check those two links, thank you very much! then only my last question remains about crawling multi url's – Angelo van Cleef Nov 25 '22 at 10:54
  • 1. You can use `chromedp.Evaluate` to execute a javascript expression to retrieve the elements as json. 2. `my last question remains about crawling multi url's`, this is a big question. One option is to read URL's from a chan, crawl it, and send new URL's into that chan. – Zeke Lu Nov 25 '22 at 11:18
  • Yeah I tried that, but I keep getting an error. Will update the post soon with that. Thanks so much for the tip to use Evaluate, I will give it a go! – Angelo van Cleef Nov 25 '22 at 11:48
  • Updated my question with the error and my current code that produces the error – Angelo van Cleef Nov 25 '22 at 12:08