Extract links from a web page using Go lang

Question

I am learning google's Go programming language. Does anyone know the best practice to extract all URLs from a html web page?

Coming from the Java world, there are libraries to do the job, for example jsoup , htmlparser, etc. But for go lang, I guess no available similar library was made yet?

score 26 · Answer 1 · answered Aug 02 '14 at 23:03

If you know jQuery, you'll love GoQuery.

Honestly, it's the easiest, most powerful HTML utility I've found in Go, and it's based off of the html package in the go.net repository. (Okay, so it's higher-level than just a parser as it doesn't expose raw HTML tokens and the like, but if you want to actually get anything done with an HTML document, this package will help.)

score 21 · Accepted Answer · answered Jun 18 '12 at 13:23

21

Go's standard package for HTML parsing is still a work in progress and is not part of the current release. A third party package you might try though is go-html-transform. It is being actively maintained.

answered Jun 18 '12 at 13:23

Sonia

27,135
8
52
54

1

I can't find an example anywhere on how to use this library for scraping and don't find it obvious from the docs. Could anyone point me to an example? – kristaps Mar 03 '14 at 09:32
Is it planned to include this package natively in Go? – Kiril Apr 03 '14 at 14:39
1

The HTML package is now available. Read the documentation here: https://godoc.org/golang.org/x/net/html – R4chi7 May 10 '16 at 15:09

score 17 · Answer 3 · edited May 23 '17 at 12:16

17

While the Go package for HTML parsing is indeed still in progress, it is available in the go.net repository.

Its sources are at ~~code.google.com/p/go.net/html~~ github.com/golang/net and it is being actively developed.

It is mentioned in this recent go-nuts discussion.

Note that with Go 1.4 (Dec 2014), as I mentioned in this answer, the package is now golang.org/x/net (see godoc).

edited May 23 '17 at 12:16

Community

1
1

answered Aug 08 '12 at 09:13

VonC

1,262,500
529
4,410
5,250

2

The Go html package has move to the [go.net](https://code.google.com/p/go/source/browse?repo=net#hg%2Fhtml) repo. [Here](http://godoc.org/code.google.com/p/go.net/html) is the documentation. – ctn May 10 '13 at 15:38
@ctn thank you for the update. Not sure why your edit was rejected: I have restored it in the answer. – VonC May 11 '13 at 00:13
Thanks. They said it would change the original meaning too much and I'd better leave a comment instead. – ctn May 13 '13 at 09:18

score 6 · Answer 4 · answered May 17 '13 at 05:09

6

I've searched around and found that there are is a library called Gokogiri which sounds alike Nogokiri for Ruby. I think the project is active too.

answered May 17 '13 at 05:09

Ye Lin Aung

11,234
8
45
51

score 0 · Answer 5 · answered Sep 13 '15 at 04:21

I just published an open source event-based HTML 5.0 compliant parsing package for Go. You can find it here

Here is the sample code to get all the links from a page (from A elements):

links := make([]string)

parser := NewParser(htmlContent)

parser.Parse(nil, func(e *HtmlElement, isEmpty bool) {
    if e.TagName == "link" {
        link,_ := e.GetAttributeValue("href")
        if(link != "") {
            links = appends(links, link)
        } 
    }
}, nil)

A few things to keep in mind:

These are relative links, not full URLs
Dynamically generated links will not be collected
There are other links not being collected (META tags, images, iframes, etc.). It's pretty easy to modify this code to collect those.

score 0 · Answer 6 · answered Jan 26 '22 at 12:29

also you may use "Colly" (documentations), it usually use for web scrapping

Features

Clean API
Fast (>1k request/sec on a single core)
Manages request delays and maximum concurrency per domain
Automatic cookie and session handling
Sync/async/parallel scraping
Distributed scraping
Caching
Automatic encoding of non-unicode responses
Robots.txt support
Google App Engine support

import (
   "fmt"
   "github.com/gocolly/colly"
)

func main() {
   c := colly.NewCollector()
 
   // Find and visit all links
   c.OnHTML("a", func(e *colly.HTMLElement) {
     e.Request.Visit(e.Attr("href"))
   })
 
   c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
   })

   c.Visit("http://go-colly.org/")
}

Extract links from a web page using Go lang

6 Answers6