1


I am making a web crawler. I'm passing the url through a crawler function and parsing it to get all the links in the anchor tag, then I am invoking same crawler function for all those urls using seperate goroutine for every url.
But if if send a request and cancel it before I get the response, all the groutines for that particular request are still running.
Now what I want is that when I cancel the request all the goroutines that got invoked due to that request stops.
Please guide.
Following is my code for the crawler function.

func crawler(c echo.Context, urlRec string, feed chan string, urlList *[]string, wg *sync.WaitGroup) {
    defer wg.Done()
    URL, _ := url.Parse(urlRec)
    response, err := http.Get(urlRec)
    if err != nil {
        log.Print(err)
        return
    }

    body := response.Body
    defer body.Close()

    tokenizer := html.NewTokenizer(body)
    flag := true
    for flag {
        tokenType := tokenizer.Next()
        switch {
        case tokenType == html.ErrorToken:
            flag = false
            break
        case tokenType == html.StartTagToken:
            token := tokenizer.Token()

            // Check if the token is an <a> tag
            isAnchor := token.Data == "a"
            if !isAnchor {
                continue
            }

            ok, urlHref := getReference(token)
            if !ok {
                continue
            }

            // Make sure the url begines in http**
            hasProto := strings.Index(urlHref, "http") == 0
            if hasProto {
                if !urlInURLList(urlHref, urlList) {
                    if strings.Contains(urlHref, URL.Host) {
                        *urlList = append(*urlList, urlHref)
                        // fmt.Println(urlHref)
                        // c.String(http.StatusOK, urlHref+"\n")Documents
                        if !checkExt(filepath.Ext(urlHref)) {
                            wg.Add(1)
                            go crawler(c, urlHref, feed, urlList, wg)
                        }
                    }
                }
            }
        }
    }
}

And following is my POST request handler

func scrapePOST(c echo.Context) error {
    var urlList []string
    urlSession := urlFound{}
    var wg sync.WaitGroup
    urlParam := c.FormValue("url")
    feed := make(chan string, 1000)
    wg.Add(1)
    go crawler(c, urlParam, feed, &urlList, &wg)
    wg.Wait()
    var count = 0
    for _, url := range urlList {
        if filepath.Ext(url) == ".jpg" || filepath.Ext(url) == ".jpeg" || filepath.Ext(url) == ".png" {
            urlSession.Images = append(urlSession.Images, url)
        } else if filepath.Ext(url) == ".doc" || filepath.Ext(url) == ".docx" || filepath.Ext(url) == ".pdf" || filepath.Ext(url) == ".ppt" {
            urlSession.Documents = append(urlSession.Documents, url)
        } else {
            urlSession.Links = append(urlSession.Links, url)
        }
        count = count + 1
    }
    urlSession.Count = count
    // jsonResp, _ := json.Marshal(urlSession)
    // fmt.Print(urlSession)
    return c.JSON(http.StatusOK, urlSession)
}
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
heartofrevel
  • 181
  • 1
  • 3
  • 12
  • 1
    This is probably what you are looking for. I'll see if I can find a good example of its usage. https://godoc.org/context#example-WithCancel EDIT: oops, linked to the old `net/context` package. It's fixed now. It has a mini example – RayfenWindspear Aug 05 '17 at 18:48
  • I'm using echo framework for server, I don't think echo has any function like this. – heartofrevel Aug 05 '17 at 18:54
  • OK then, you can roll your own kill channel. In this case I'll make a few assumptions in my answer, but I think it will work just fine. – RayfenWindspear Aug 05 '17 at 19:06
  • BTW, you realize that you are concurrently writing to a `slice`, which is VERY BAD. It's a separate issue which I am selectively ignoring in my answer because it's a separate issue... – RayfenWindspear Aug 05 '17 at 19:29
  • @RayfenWindspear any alternative, should i create a slice with large capacity and add by index? – heartofrevel Aug 05 '17 at 21:06
  • You have no idea how large to make it. Instead of passing the `slice`, make a `channel` to pass to the crawler to put values into. Then immediately before your first crawl, set up a single goroutine that reads from the `channel` into the `slice`. Then after `wg.Wait` close the channel and it is safe to use the `slice`. Be sure to close this channel too if the request is cancelled. – RayfenWindspear Aug 05 '17 at 21:14
  • @RayfenWindspear but then again, i will be getting the value from channel and appending it to slice. And I need to check if the current url is in the urllist before adding it to it. Also i cannot maintain a global variable as it will be then same for every request made. – heartofrevel Aug 05 '17 at 23:22
  • Given that you need to check if the URL is in the slice, and then write to it, a channel isn't what you need. You will need a mutex to lock the slice when it is being read/written. You can either do this yourself, or find a lib with a concurrent safe slice. I highly suggest a library so you don't have to worry about not getting it quite right. It is easy to miss a single spot and have things go haywire. – RayfenWindspear Aug 06 '17 at 18:44
  • @RayfenWindspear No i wanted to know, if there is any alternative to appending to slice. – heartofrevel Aug 06 '17 at 21:07
  • If you know how many items there are, yes you could allocate then use index only. Pre-allocation then writing only by index is safe. However, yours is a recursive solution, and also by the problem's very nature, you won't know how many links you will pull. So you are basically stuck using concurrency safeties. – RayfenWindspear Aug 06 '17 at 21:17

1 Answers1

10

The echo context exposes the HTTP request, which has a context tied to the server request already. Just get that context, and check it for cancellation, and/or pass it along to methods that take a context.

ctx := c.Request().Context()
select {
case <-ctx.Done():
    return ctx.Err()
default:
    // Continue handling the request
}

// and pass along to the db or whatever else:
rows, err := db.QueryContext(ctx, ...)

If the client aborts the connection, the Request-scoped context will automatically be cancelled.

If you want to add your own cancellation conditions, (timeouts, or whatever) you can do that, too:

req := c.Request()
ctx, cancel := context.WithCancel(req.Context())
req = req.WithContext(ctx)
defer cancel()
// do stuff, which may conditionally call cancel() to cancel the context early
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
  • Ohh, that's what you mean... I totally spaced that. My original comment had the answer, I just forgot a real legit `context` was in the request XD. – RayfenWindspear Aug 05 '17 at 20:02
  • Correct answer but I would like to add something in case anyone was searching like me and the context still didn't get cancelled when he cancels the request itself from the client side. Please make sure that the request body is empty or closed. ref: https://stackoverflow.com/questions/57246852/go-http-context-not-able-to-capture-cancellation-signal-when-request-has-body-c – Haytham.Breaka Jan 29 '20 at 11:29