0

I need to scrape a website where the tag i'm interested into is:

    <script type="myjson">
        [{"class": "companyname", "location"....and so on
    </script>

currently I am doing the job (goquery) with this code snippet:

        doc.Find("script").Each(func(i int, element *goquery.Selection) {
        _, exists := element.Attr("type")
        if exists {
                var filepath string
                filepath = "mypath" 
                
                file, err := os.Create(filepath)
                if err != nil {
                    panic("COULD NOT CREATE FILE")  
                }               
                file.WriteString(element.Text())
                fmt.Println(element.Text())
                file.Close()

the problem with this code is that while element.Text() is correctly printed to stdout (it prints a long slice with several jsons inside, which i need to print to a file for later work), the file.WriteString statement does not print anything to the file. The file remains empty.

It appears that my query is wrong and that it outputs 2 elements; the first with zero lenght, which is the one that is printed to the file, and the second with the real content, which is printed to stdout but not to the file.

Can you please suggest a correction to my code in order to print the content correctly to the file? I guess there may be an error in my goquery query.

1 Answers1

0

A quick test shows that just calling .Text() should be enough, see the code below:

package main

import (
 "fmt"
 "os"
 "strings"

 "github.com/PuerkitoBio/goquery"
)

func main() {

 htmlString := `<!DOCTYPE html>
 <html lang="en">
 <head>
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
 </head>
 <body>
  <h1>AWESOME HEADER</h1>
  <script type="myjson">
   [{"class": "companyClass", "location": "companyLocation"}]
  </script>
 
 </body>
 </html>`

 doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlString))
 if err != nil {
  panic(err)
 }

 doc.Find("script").Each(func(i int, element *goquery.Selection) {
  _, exists := element.Attr("type")
  if exists {
   file, err := os.Create("result.txt")
   if err != nil {
    panic(err)
   }
   defer file.Close()

   stringToWrite := strings.TrimSpace(element.Text())
   fmt.Println(stringToWrite)
   file.WriteString(stringToWrite)
  }
 })

}

The resulting file as well as stdout contain:

[{"class": "companyClass", "location": "companyLocation"}]

Please provide the html (or its section relevant to the problem) that you are working with.

jabbson
  • 4,390
  • 1
  • 13
  • 23