2

I'm working on a golang web crawler that should parse the search results on some specific search engine. The main difficulty - parsing with concurrency, or rather, in processing pagination such as ← Previous 1 2 3 4 5 ... 34 Next →. All things work fine except recursive crawling of paginated results. Look at my code:

package main

import (
    "bufio"
    "errors"
    "fmt"
    "net"
    "strings"

    "github.com/antchfx/htmlquery"
    "golang.org/x/net/html"
)

type Spider struct {
    HandledUrls []string
}

func NewSpider(url string) *Spider {
    // ...
}

func requestProvider(request string) string {
    // Everything is good here
}

func connectProvider(url string) net.Conn {
    // Also works
}

// getContents makes request to search engine and gets response body
func getContents(request string) *html.Node {
    // ...
}

// CheckResult controls empty search results
func checkResult(node *html.Node) bool {
    // ...
}

func (s *Spider) checkVisited(url string) bool {
    // ...
}

// Here is the problems
func (s *Spider) Crawl(url string, channelDone chan bool, channelBody chan *html.Node) {
    body := getContents(url)

    defer func() {
        channelDone <- true
    }()

    if checkResult(body) == false {
        err := errors.New("Nothing found there")
        ErrFatal(err)
    }

    channelBody <- body
    s.HandledUrls = append(s.HandledUrls, url)
    fmt.Println("Handled ", url)

    newUrls := s.getPagination(body)

    for _, u := range newUrls {
        fmt.Println(u)
    }

    for i, newurl := range newUrls {
        if s.checkVisited(newurl) == false {
            fmt.Println(i)
            go s.Crawl(newurl, channelDone, channelBody)
        }
    }
}

func (s *Spider) getPagination(node *html.Node) []string {
    // ...
}

func main() {
    request := requestProvider(*requestFlag)
    channelBody := make(chan *html.Node, 120)
    channelDone := make(chan bool)

    var parsedHosts []*Host

    s := NewSpider(request)

    go s.Crawl(request, channelDone, channelBody)

    for {
        select {
        case recievedNode := <-channelBody:
             // ...

            for _, h := range newHosts {
                 parsedHosts = append(parsedHosts, h)
                 fmt.Println("added", h.HostUrl)
            }

        case <-channelDone:
            fmt.Println("Jobs finished")
        }

        break
   }
}

It always returns the first page only, no pagination. Same GetPagination(...) works good. Please tell me, where is my error(s). Hope Google Translate was correct.

himei
  • 23
  • 3

1 Answers1

2

The problem is probably that main exits before all goroutine finished.

First, there is a break after the select statement and it runs uncodintionally after first time a channel is read. That ensures the main func returns after the first time you send something over channelBody.

Secondly, using channelDone is not the right way here. The most idomatic approach would be using a sync.WaitGroup. Before starting each goroutine, use WG.Add(1) and replace the defer with defer WG.Done(); In main, use WG.Wait(). Please be aware that you should use a pointer to refer to the WaitGroup. You can read more here.

leaf bebop
  • 7,643
  • 2
  • 17
  • 27