retrieving text from a website with goquery

Question

I have a html roughly looking like this:

<h4>Movies</h4>
    <h5><a href="external_link" target="_blank"> A Song For Jenny</a> (2015)</h5>
    Rating: PG<br/>
    Running Time (minutes): 77<br/>
    Description: This Drama, based on real life events, tells the story of a family affected directly by the 7/7 London bombings.  It shows love, loss, heartache and  ...<br/>
    <a href="/bmm/shop/Movie_Detail?movieid=2713288">More about  A Song For Jenny</a><br/>
        <a href="/bmm/shop/Edit_Movie?movieid=2713288">Edit  A Song For Jenny</a><br/>
    <br/>
    <h5><a href="link" target="_blank">#RealityHigh</a> (2017)</h5>
    Rating: PG<br/>
    Running Time (minutes): 99<br/>
    Description: High-achieving high-school senior Dani Barnes dreams of getting into UC Davis, the world's top  veterinary school. Then a glamorous new friend draws  ...<br/>
    <a href="/bmm/shop/Movie_Detail?movieid=4089906">More about #RealityHigh</a><br/>
        <a href="/bmm/shop/Edit_Movie?movieid=4089906">Edit #RealityHigh</a><br/>
    <br/>
    <h5><a href="link" target="_blank">1 Night</a> (2016)</h5>
    Rating: PG<br/>
    Running Time (minutes): 80<br/>
    Description: Bea, a worrisome teenager, reconnects with her introverted childhood friend, Andy. The two  overcome their differences in social status one night aft ...<br/>
    <a href="/bmm/shop/Movie_Detail?movieid=3959071">More about 1 Night</a><br/>
        <a href="/bmm/shop/Edit_Movie?movieid=3959071">Edit 1 Night</a><br/>
    <br/>
    <h5><a href="link" target="_blank">10 Cloverfield Lane</a> (2016)</h5>
    Rating: PG<br/>
    Running Time (minutes): 104<br/>
    Description: Soon after leaving her fiancé Michelle is involved in a car accident. She awakens
to find herself sharing an underground bunker with Howard and Emme ...<br/>
    <a href="/bmm/shop/Movie_Detail?movieid=3052189">More about 10 Cloverfield Lane</a><br/>
        <a href="/bmm/shop/Edit_Movie?movieid=3052189">Edit 10 Cloverfield Lane</a><br/>
    <br/>

I need to use goquery to get as much information out of this page as possible. I know how to extract the external links replaced by the word "link" in this fragment, I know how to get to the links with more details but I also want to extract the information only contained in text, i.e. year (in the headings), running time, shortened description and PG rating. I couldn't figure out how to do this in goquery because this text isn't surrounded by any divs or other tags. I tried looking for h5 tags and then calling .Next() on them but I could only find the <br> tags, not the text inbetween. How can I do that? If there's a better way to do it than using goquery, I'm fine with that. My code looks like this.

// Retrieve the page count:
    res, err = http.Get("myUrlAddress")
    if err != nil {
        fmt.Println(err)
        os.Exit(-1)
    }
    doc, err = goquery.NewDocumentFromResponse(res)
    if err != nil {
        fmt.Println(err)
        os.Exit(-1)
    }
    links := doc.Find(`a[href*="pageIndex"]`)
    fmt.Println(links.Length()) // Output page count
s := doc.Find("h5").First().Next() // I expect it to be the text after the heading.
fmt.Println(s.Text()) // But it's empty and if I check the node type it says br

Please include your current code, explain what problem you're having, and what you expected instead. — Jonathan Hall, Dec 21 '17 at 20:51
I think it will be hard, because the text that you want to extract is not in the `Document` node. Another option is using `regex` to extract it. — Dharma Saputra, Dec 22 '17 at 06:53

score 1 · Accepted Answer · answered Dec 27 '17 at 07:35

I somehow don't like the idea of using regex to parse html. I feel it to be too fragile against minor changes like tags order or something like that.

I think it is the best to fall back on html.Node(golang.org/x/net/html), which goquery is based on. The idea is to iterate over siblings until it runs out, or the next h5 is encountered. It might be a little trouble to deal with links or any other element tags as the html.Node provides a rather unfriendly api regarding attributes, but switching back to goquery from it is even more trouble.

package main

import (
    "fmt"
    "github.com/PuerkitoBio/goquery"
    "golang.org/x/net/html"
    "golang.org/x/net/html/atom"
    "os"
    "strings"
)

type Movie struct {
}

func (m Movie) addTitle(s string) {
    fmt.Println("Title", s)
}

func (m Movie) addProperty(s string) {
    if s == "" {
        return
    }
    fmt.Println("Property", s)
}

var M []*Movie

func parseMovie(i int, s *goquery.Selection) {
    m := &Movie{}
    m.addTitle(s.Text())

loop:
    for node := s.Nodes[0].NextSibling; node != nil; node = node.NextSibling {
        switch node.Type {
        case html.TextNode:
            m.addProperty(strings.TrimSpace(node.Data))
        case html.ElementNode:
            switch node.DataAtom {
            case atom.A:
                //link, do something. You may want to transfer back to go query
                fmt.Println(node.Attr)
            case atom.Br:
                continue
            case atom.H5:
                break loop
            }
        }
    }

    M = append(M, m)
}

func main() {
    r, err := os.Open("movie.html")
    if err != nil {
        panic(err)
    }
    doc, err := goquery.NewDocumentFromReader(r)
    if err != nil {
        panic(err)
    }

    doc.Find("h5").Each(parseMovie)
}

maerics · Answer 2 · 2017-12-27T05:17:44.717

Unfortunately, due to how this HTML page is structured, it doesn't seem like goquery will be of much help after you've identified the section of the page that contains the movie listings in your example because the data points of interest are not isolated into elements that can be targetted by goquery.

However, the details can easily be parsed using regular expressions, which can of course be modified as needed (especially if/when the original page changes its HTML structure).

type Movie struct {
    Title          string
    ReleaseYear    int
    Rating         string
    RuntimeMinutes int
    Description    string
}

var movieregexp = regexp.MustCompile(`` +
    `<h5><a.*?>\s*(.*?)\s*</a>\s*\((\d{4})\)</h5>` + // Title and release year
    `[\s\S]*?Rating: (.*?)<` +
    `[\s\S]*?Running Time \(minutes\): (\d{1,3})` +
    `[\s\S]*?Description: ([\s\S]*?)<`)

// Returns a slice of movies parsed from the given string, possibly empty.
func ParseMovies(s string) []Movie {
    movies := []Movie{}
    groups := movieregexp.FindAllStringSubmatch(s, -1)

    if groups != nil {
        for _, group := range groups {
            // We know these integers parse correctly because of the regex.
            year, _ := strconv.Atoi(group[2])
            runtime, _ := strconv.Atoi(group[4])
            // Append the new movie to the list.
            movies = append(movies, Movie{
                Title:          group[1],
                ReleaseYear:    year,
                Rating:         group[3],
                RuntimeMinutes: runtime,
                Description:    group[5],
            })
        }
    }

    return movies
}

That's what I've done in the end. I've used html nodes and worked on them directly (with a lot of hacking around stupidities and quirks of the original website). My code is going to land on github in a few days, I will psot a link to it in the comments when that happens. — Mikołaj Hołysz, May 04 '18 at 19:19

retrieving text from a website with goquery

2 Answers2