How to extract the text of a custom html tag with goquery?

Question

I am trying to extract the text a custom html tag (<prelogin-cookie>):

someHtml := `<html><body>Login Successful!</body><!-- <saml-auth-status>1</saml-auth-status><prelogin-cookie>4242424242424242</prelogin-cookie><saml-username>my-username</saml-username><saml-slo>no</saml-slo> --></html>`
query, _ := goquery.NewDocumentFromReader(strings.NewReader(someHtml))
sel:= query.Find("prelogin-cookie")
println(sel.Text())

But it does not return anything, just an empty string, how can I get the actual text of that html tag, aka 4242424242424242?

try to check the error returned in the 2nd parameter of `goquery.NewDocumentFromReader`. Probably the parse procces is failing — novalagung, Dec 19 '19 at 09:37

icza · Accepted Answer · 2019-12-19T13:52:45.563

<prelogin-cookie> is not found because it's inside an HTML comment.

Your comment is actually a series of XML or HTML tags, it may be processed as HTML if you use that as the input document.

Warning. Only the first solution below handles "all" HTML documents properly. The other solutions are simpler and will also handle your case just fine, but they might not handle some edge cases. Decide if they worth using for you.

1. By searching the HTML node tree

One way to find and extract the comment would be to traverse the HTML node tree and look for a node with type html.CommentNode.

For this, we'll use a recursive helper function to traverse a node tree:

func findComment(n *html.Node) *html.Node {
    if n == nil {
        return nil
    }
    if n.Type == html.CommentNode {
        return n
    }
    if res := findComment(n.FirstChild); res != nil {
        return res
    }
    if res := findComment(n.NextSibling); res != nil {
        return res
    }
    return nil
}

And using it:

doc, err := goquery.NewDocumentFromReader(strings.NewReader(someHtml))
if err != nil {
    panic(err)
}

var comment *html.Node
for _, node := range doc.Nodes {
    if comment = findComment(node); comment != nil {
        break
    }
}
if comment == nil {
    fmt.Println("no comment")
    return
}

doc, err = goquery.NewDocumentFromReader(strings.NewReader(comment.Data))
if err != nil {
    panic(err)
}

sel := doc.Find("prelogin-cookie")
fmt.Println(sel.Text())

This will print (try it on the Go Playground):

4242424242424242

2. With `strings`

If you just have to handle the "document at hand", a simpler solution may be to just use strings package to find the start and end indices of the comment:

start := strings.Index(someHtml, "<!--")
if start < 0 {
    panic("no comment")
}
end := strings.Index(someHtml[start:], "-->")
if end < 0 {
    panic("no comment")
}

And using this as the input:

doc, err := goquery.NewDocumentFromReader(strings.NewReader(someHtml[start+4 : end]))
if err != nil {
    panic(err)
}

sel := doc.Find("prelogin-cookie")
fmt.Println(sel.Text())

This will output the same. Try it on the Go Playground).

3. Using `regexp`

A simpler (but less efficient) alternative of the previous solution is to use regexp to get the comment out of the original document:

comments := regexp.MustCompile(`<!--(.*?)-->`).FindAllString(someHtml, -1)
if len(comments) == 0 {
    fmt.Println("no comment")
    return
}

doc, err := goquery.NewDocumentFromReader(strings.NewReader(
    comments[0][4 : len(comments[0])-3]))

Try this one on the Go Playground.

I see! Thanks for your answer, so as far as I understand there is no way to achieve that in one shot — Natalie Perret, Dec 19 '19 at 09:43
@EhouarnPerret if you just want a one liner you could probably do `goquery.NewDocumentFromReader(strings.NewReader(strings.ReplaceAll(strings.ReplaceAll(someHtml, "<---", ""), "--->","")))` — dave, Feb 16 '20 at 10:16

How to extract the text of a custom html tag with goquery?

1 Answers1

1. By searching the HTML node tree

2. With strings

3. Using regexp

2. With `strings`

3. Using `regexp`