-3

I'm trying to use golang to extract the text from html, and I use the goquery library to do this. The code like below:

document, err := goquery.NewDocumentFromReader(r)
if err != nil {
    log.Fatalln(err)
}
document.Find("script").Remove()
document.Find("style").Remove()
text := document.Find("body").Text()

The test html page: enter image description here

but the result: enter image description here

you can find the result still contains the html tag, how could I remove the html tags and only keep the text?

Bill
  • 84
  • 5

1 Answers1

1

Take the ul element out of the text area. it's being treated as text it's self. enter image description here

foecum
  • 562
  • 3
  • 13
  • actually I'm trying to write the web spider, this html page is copied from the real web page and I simplify that page for test, so I want to find the way to extract the text from the page like the test html page – Bill Sep 24 '16 at 00:08
  • A textarea can only contain text, not HTML elements. You can't display an HTML list inside a textarea. It will always be treated as text/string. That is why it's returned as part of the .Text()'s return content – foecum Sep 24 '16 at 00:15