0

I am building a web crawler application in golang.

After downloading the HTML of a page, I separate out the URLs. I am presented with URLs that have "#s" in them, such as "en.wikipedia.org/wiki/Race_condition#Computing". I would like to get rid of all characters following the "#", since these lead to the same page anyways. Any advice for how to do so?

2 Answers2

2

Use the url package:

u, _ := url.Parse("SOME_URL_HERE")
u.Fragment = ""
return u.String()
Luke Joshua Park
  • 9,527
  • 5
  • 27
  • 44
1

An improvement on the answer by Luke Joshua Park is to parse the URL relative to the URL of the source page. This creates an absolute URL from what might be relative URL on the page (scheme not specified, host not specified, relative path). Another improvement is to check and handle errors.

func clean(pageURL, linkURL string) (string, error) {
    p, err := url.Parse(pageURL)
    if err != nil {
        return "", err
    }
    l, err := p.Parse(linkURL)
    if err != nil {
        return "", err
    }
    l.Fragment = ""   // chop off the fragment
    return l.String()
}

If you are not interested in getting an absolute URL, then chop off everything after the #. This works because the only valid use of # in a URL is the fragment separator.

 func clean(linkURL string) string {
    i := strings.LastIndexByte(linkURL, '#')
    if i < 0 {
        return linkURL
    }
    return linkURL[:i]
 }