27

I am learning google's Go programming language. Does anyone know the best practice to extract all URLs from a html web page?

Coming from the Java world, there are libraries to do the job, for example jsoup , htmlparser, etc. But for go lang, I guess no available similar library was made yet?

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Jifeng Zhang
  • 5,037
  • 4
  • 30
  • 43

6 Answers6

26

If you know jQuery, you'll love GoQuery.

Honestly, it's the easiest, most powerful HTML utility I've found in Go, and it's based off of the html package in the go.net repository. (Okay, so it's higher-level than just a parser as it doesn't expose raw HTML tokens and the like, but if you want to actually get anything done with an HTML document, this package will help.)

Matt
  • 22,721
  • 17
  • 71
  • 112
21

Go's standard package for HTML parsing is still a work in progress and is not part of the current release. A third party package you might try though is go-html-transform. It is being actively maintained.

Sonia
  • 27,135
  • 8
  • 52
  • 54
  • 1
    I can't find an example anywhere on how to use this library for scraping and don't find it obvious from the docs. Could anyone point me to an example? – kristaps Mar 03 '14 at 09:32
  • Is it planned to include this package natively in Go? – Kiril Apr 03 '14 at 14:39
  • 1
    The HTML package is now available. Read the documentation here: https://godoc.org/golang.org/x/net/html – R4chi7 May 10 '16 at 15:09
17

While the Go package for HTML parsing is indeed still in progress, it is available in the go.net repository.

Its sources are at code.google.com/p/go.net/html github.com/golang/net and it is being actively developed.

It is mentioned in this recent go-nuts discussion.


Note that with Go 1.4 (Dec 2014), as I mentioned in this answer, the package is now golang.org/x/net (see godoc).

Community
  • 1
  • 1
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • 2
    The Go html package has move to the [go.net](https://code.google.com/p/go/source/browse?repo=net#hg%2Fhtml) repo. [Here](http://godoc.org/code.google.com/p/go.net/html) is the documentation. – ctn May 10 '13 at 15:38
  • @ctn thank you for the update. Not sure why your edit was rejected: I have restored it in the answer. – VonC May 11 '13 at 00:13
  • Thanks. They said it would change the original meaning too much and I'd better leave a comment instead. – ctn May 13 '13 at 09:18
6

I've searched around and found that there are is a library called Gokogiri which sounds alike Nogokiri for Ruby. I think the project is active too.

Ye Lin Aung
  • 11,234
  • 8
  • 45
  • 51
0

I just published an open source event-based HTML 5.0 compliant parsing package for Go. You can find it here

Here is the sample code to get all the links from a page (from A elements):

links := make([]string)

parser := NewParser(htmlContent)

parser.Parse(nil, func(e *HtmlElement, isEmpty bool) {
    if e.TagName == "link" {
        link,_ := e.GetAttributeValue("href")
        if(link != "") {
            links = appends(links, link)
        } 
    }
}, nil)

A few things to keep in mind:

  • These are relative links, not full URLs
  • Dynamically generated links will not be collected
  • There are other links not being collected (META tags, images, iframes, etc.). It's pretty easy to modify this code to collect those.
Marcelo Calbucci
  • 5,845
  • 3
  • 18
  • 22
0

also you may use "Colly" (documentations), it usually use for web scrapping

Features

  1. Clean API
  2. Fast (>1k request/sec on a single core)
  3. Manages request delays and maximum concurrency per domain
  4. Automatic cookie and session handling
  5. Sync/async/parallel scraping
  6. Distributed scraping
  7. Caching
  8. Automatic encoding of non-unicode responses
  9. Robots.txt support
  10. Google App Engine support
import (
   "fmt"
   "github.com/gocolly/colly"
)

func main() {
   c := colly.NewCollector()
 
   // Find and visit all links
   c.OnHTML("a", func(e *colly.HTMLElement) {
     e.Request.Visit(e.Attr("href"))
   })
 
   c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
   })

   c.Visit("http://go-colly.org/")
}