0

What change do I make in below code to index in elastic using go-colly?

  1. I want to get full text (strip html, strip js, render if needed), then

  2. Conform it to an avro schema {pageurl: , title:, content:},

  3. Bulk-post to specific elastic-search 'mywebsiteindex-yyyymmdd' - perhaps use config file, and not hardcoding.

Code snippets would be great. Is there an example go-colly code that shows "pipelining" output of crawl->scraping->yield to elastic (e.g as in python scrapy framework). I.e pipelining framework support.

For inserting to elastic, I'm considering: https://github.com/olivere/elastic ?

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains( "www.coursera.org"),
        colly.Async(true),
    )

    c.Limit(&colly.LimitRule{
        DomainGlob: "*",
         Parallelism: 2,
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        e.Request.Visit(link)
    })
    pageCount :=0
    c.OnRequest(func(r *colly.Request) {
        r.Ctx.Put("url", r.URL.String())
    })

    // Set error handler
    c.OnError(func(r *colly.Response, err error) {
        log.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
    })

    // Print the response
    c.OnResponse(func(r *colly.Response) {
        pageCount++
        urlVisited := r.Ctx.Get("url")
        log.Println(fmt.Sprintf("%d  DONE Visiting : %s", pageCount, urlVisited))
    })

    baseUrl := "https://www.coursera.org"
    c.Visit(baseUrl)
 }
Espresso
  • 5,378
  • 4
  • 35
  • 66

1 Answers1

0

You are correct that you will need an additional library to store data into elastic. go-colly is only doing the scraping part of job. Depending on your scraping strategy you will need to write a code to store results of scraping into indices.

Generally, you want to use a library like olivere/elastic, connect to elastic and initialize the index. Then you likely want to have a function that will store structured data into that index and call that function from appropriate go-colly callback (e.g. c.OnHTML()) when you have all data you want to store (what is that is not really clear from code snippet provided). To read more on how to use olivere/elastic (note that version 7 has breaking API changes so some tutorials for older versions might not work) see godoc.

There are many decisions to make on the way depending on your particular use case (e.g. deciding how data will be structured in indices, when data should be sent to elastic - meaning which go-colly callback to use for that, how do you want refresh pages that already are in index, etc.).

As for frameworks I am not aware of anything that would have end-end pipeline from scraping to storing in elastic.

blami
  • 6,588
  • 2
  • 23
  • 31
  • 1
    It still depends on many factors and there are many additional questions (question seem very open). Do you want to store to elastic when you get full response, or when you e.g. hit new page? Do you want to refresh existing documents in index? etc. I provided general answer because there are many unknowns. – blami May 07 '20 at 06:23
  • thanks I've looked at dataflowkit. I find go-colly to be lighter therfore easier to understad. perhaps if I just add elastic part, I'll use that to dive into golang deeper. – Espresso May 07 '20 at 06:26
  • Thanks for good feedback. I updated my answer with design of very general pipeline that would combine both. – blami May 07 '20 at 06:40