2

I have written a colly script to collect port authority information from a site.

func main() {
    // Temp Variables
    var tcountry, tport string

    // Colly collector
    c := colly.NewCollector()

    //Ignore the robot.txt
    c.IgnoreRobotsTxt = true
    // Time-out after 20 seconds.
    c.SetRequestTimeout(20 * time.Second)
    //use random agents during requests
    extensions.RandomUserAgent(c)

    //set limits to colly opoeration
    c.Limit(&colly.LimitRule{
        //  // Filter domains affected by this rule
        DomainGlob: "searates.com/*",
        //  // Set a delay between requests to these domains
        Delay: 1 * time.Second,
        //  // Add an additional random delay
        RandomDelay: 3 * time.Second,
    })

    // Find and visit all country links
    c.OnHTML("#clist", func(e *colly.HTMLElement) {
        // fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
        e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
            tcountry = el.ChildText("a")
            link := el.ChildAttr("a", "href")
            fmt.Println("Country: ", tcountry, link)
            e.Request.Visit(link)
        })

    })

    // Find and visit all ports links
    c.OnHTML("#plist", func(h *colly.HTMLElement) {
        // fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
        h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
            tport = el.ChildText("a")
            link := el.ChildAttr("a", "href")
            fmt.Println("Port: ", tport, link)
            h.Request.Visit(link)
        })
    })

    // Find and visit all ports info page
    c.OnHTML("div.row", func(e *colly.HTMLElement) {
        portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
        fmt.Println("Port Authority: ", portAuth)
    })

    c.Visit("https://www.searates.com/maritime/")
}

I have two questions below:

  1. Furthermore, I am kind of forced to use e.Request.Visit because d.Visit (if I clone c) doesn't get executed. I see that while I cloned c as d and used to get the 'port info' part, the whole block was skipped. What am I doing wrong here/why this behavior?

  2. In the current code as is the fmt.Println("Port Authority: ", portAuth) get executed twice. I am getting a print as below:

❯ go run .
Country:  Albania /maritime/albania
Port:  Durres /port/durres_al
Port Authority:  Durres Port Authority
Port Authority:  
Port:  Sarande /port/sarande_al
Port Authority:  Sarande Port Authority
Port Authority:  
Port:  Shengjin /port/shengjin_al
Port Authority:  Shengjin Port Authority
Port Authority:  

Again, I am failing to understand why it's getting printed twice. Kindly help :)

CaptV89
  • 61
  • 1
  • 5

1 Answers1

1

From the Go documentation:

collector.Visit - Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided callbacks

Request.Visit - Visit continues Collector's collecting job by creating a request and preserves the Context of the previous request. Visit also calls the previously provided callbacks.

The difference then is the depth parameter and context. If you use the collector.Visit inside of an event handler the depth is always 1.

Here are the invocation differences:

collector.Visit:

if c.CheckHead {
    if check := c.scrape(URL, "HEAD", 1, nil, nil, nil, true); check != nil {
        return check
    }
}
return c.scrape(URL, "GET", 1, nil, nil, nil, true)

Request.Visit:

return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)

Addressing your questions specifically, to invoke the cloned d, you would need to trigger a d.Visit within a c.OnHTML event handler. See the coursera example. You also need to use the AbsoluteURL as the cloned collector doesn't have context of the link (e.g. if it's relative). Here is it all put together:

func main() {
    // Temp Variables
    var tcountry, tport string

    // Colly collector
    c := colly.NewCollector()

    //Ignore the robot.txt
    c.IgnoreRobotsTxt = true
    // Time-out after 20 seconds.
    c.SetRequestTimeout(20 * time.Second)
    //use random agents during requests
    extensions.RandomUserAgent(c)

    //set limits to colly opoeration
    c.Limit(&colly.LimitRule{
        //  // Filter domains affected by this rule
        DomainGlob: "searates.com/*",
        //  // Set a delay between requests to these domains
        Delay: 1 * time.Second,
        //  // Add an additional random delay
        RandomDelay: 3 * time.Second,
    })

    d := c.Clone()

    // Find and visit all country links
    c.OnHTML("#clist", func(e *colly.HTMLElement) {
        // fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
        e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
            tcountry = el.ChildText("a")
            link := el.ChildAttr("a", "href")
            fmt.Println("Country: ", tcountry, link)
            e.Request.Visit(link)
        })

    })

    // Find and visit all ports links
    c.OnHTML("#plist", func(h *colly.HTMLElement) {
        // fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
        h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
            tport = el.ChildText("a")
            link := el.ChildAttr("a", "href")
            fmt.Println("Port: ", tport, link)

            absoluteURL := h.Request.AbsoluteURL(link)
            d.Visit(absoluteURL)
        })
    })

    // Find and visit all ports info page
    d.OnHTML("div.row", func(e *colly.HTMLElement) {
        portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
        if len(portAuth) > 0 {
            fmt.Println("Port Authority: ", portAuth)
        }
    })

    c.Visit("https://www.searates.com/maritime/")
}

Notice how the absolute URL is used because the context is different across collectors and so the cloned collector is not able to navigate the relative URL link.

Regarding the second question of why it's printed twice, it's because there are 2 div.row elements on the given page. I've tried various different CSS selection methods to apply the event handler to only the first div.row, but it's easier to just add a check for the string length to be greater than 0.

jth_92
  • 1,120
  • 9
  • 23