0

I was doing some web scraping using colly but wanted to run it periodically using cron. I did try out a basic approach to it.

type scraper struct {
    coll *colly.Collector
    rc   *redis.Client
}

func newScraper(c *colly.Collector, rc *redis.Client) scraper {
    return scraper{coll: c, rc: rc}
}

func main() {
    rc := redis.NewClient(&redis.Options{
        Addr:     "localhost:3000",
        Password: "", // no password set
        DB:       0,  // use default DB
    })

    coll := colly.NewCollector()

    scrape := newScraper(coll, rc)

    c := cron.New()
    c.AddFunc("@every 10s", scrape.scrapePls)
    c.Start()

    sig := make(chan int)
    <-sig
}

func (sc scraper) scrapePls() {
    sc.coll.OnHTML(`body`, func(e *colly.HTMLElement) {
        //Extracting required content

        //Using Redis to store data
    })

    sc.coll.OnRequest(func(r *colly.Request) {
        log.Println("Visting", r.URL)
    })

    sc.coll.Visit("www.example.com")
}

It seems to not be working, makes a call once and doesn't periodically make the next call. Not sure if I am missing out on something. Is there any other approaches that can be taken?

Any help would be appreciated.

Thanks!

1 Answers1

0

c.AddFunc returns an error which you are not checking, please do in case that reveals further information.

You should be able to inspect the return of c.Entries() which should give you information about the next time your function will be called.

In case you were not aware, you don't need a full library to accomplish executing a function periodically. You can for example do:

scrape := newScraper(coll, rc)

sig := make(chan os.Signal, 1)
signal.Notify(sig, os.Interrupt)
ticker := time.NewTicker(10 * time.Second)

// Run the function initially, so we don't have to wait 10 seconds for the first run (optional).
scrapePls()
for {
    select {
    case <-ticker.C:
        // Ticker will send a message every 10 seconds
        scrapePls()

        // You can also start a go routine every time. If scrapePls takes more than the interval
        // to run this may lead to issues to due to an forever increasing number of goroutines.
        // go scrapePls()
        
    case <-sig
        return
    }
}
Dylan Reimerink
  • 5,874
  • 2
  • 15
  • 21
  • Thanks for the solution about using the ticker for periodically call it. I did add c.Entries and did get this `{1 {30s} 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC 0x6efa80 0x6efa80}]`. Wasn't helpful for me. Does this help? – Adith Dev Reddy Nov 14 '21 at 12:43
  • It still stops after the first call. – Adith Dev Reddy Nov 14 '21 at 12:43
  • What the `c.Entries` shows is that it is scheduled, just for every 30 seconds, not every 10. The times are still uninitialized, they we be set after the first execution. As for the "it still stops after the first call" - Do you mean with the ticker? If so, it means that you never return from `scrapePls`. I recommend you setup [delve](https://github.com/go-delve/delve) and step through your program so you can see where things go wrong – Dylan Reimerink Nov 14 '21 at 14:15