1

Scrapper colly in headless mode?

Hello,

I am new on golang and I have to make a scraper for my school in France.

The site I have to scrape is www.allrecipes.com. On this site, I chose this page https://www.allrecipes.com/recipes/17562/dinner/

On this site, I have to get some recipes, more precisely : title, url, ingredients, steps, descriptions.

I saw that the site www.allrecipes.com was made in vue.js and that when I want to get the URLs, I can't.

In the code, I use colly. can we use colly how headless type => "chromedp" ?

package main

import (
    "encoding/json"
    "fmt"
    "os"

    "github.com/gocolly/colly"
)

type products struct {
    Name string `json:"name"`
    URL  string `json:"url"`
}

var allProducts []products

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("www.allrecipes.com"),
    )

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Scraping:", r.URL)
    })

    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Status:", r.StatusCode)
    })

    // OnHTML enregistre une fonction. La fonction sera exécutée sur chaque HTML élément correspondant au paramètre
    c.OnHTML("a.mntl-card", func(h *colly.HTMLElement) {
        products := products{
            URL:  h.ChildAttr("a.mntl-card-list-items", "href"),
            Name: h.ChildText(".card__title-text"),
        }
        fmt.Println(products)
        allProducts = append(allProducts, products)
    })

    c.OnError(func(r *colly.Response, err error) {
        fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "nError:", err)
    })

    c.Visit("https://www.allrecipes.com/recipes/17562/dinner/")

    content, err := json.Marshal(allProducts)
    if err != nil {
        fmt.Println(err.Error())
    }
    os.WriteFile("data.json", content, 0644)
    fmt.Println("Total produts: ", len(allProducts))
}

maka
  • 39
  • 7

1 Answers1

0

It seems that this is the only change needed to make it work:

 products := products{
-   URL:  h.ChildAttr("a.mntl-card-list-items", "href"),
+   URL:  h.Attr("href"),
    Name: h.ChildText(".card__title-text"),
 }

Please note that a.mntl-card-list-items is the same element of a.mntl-card in that page.

Notes:

  1. Colly does not involve a browser, so it has nothing to do with "headless" mode.
  2. It seems that the page does not use vue.js and the html response already has everything you need. In this case, Colly is a perfect fit.
  3. chromedp drives a real browser and it's heavy comparing to Colly. You don't need it when Colly can do the job.
Zeke Lu
  • 6,349
  • 1
  • 17
  • 23