2

I have a collections of 2,000,000 records

> db.events.count();                                     │
2000000             

and I use golang mongodb client to connect to the database

package main

import (
    "go.mongodb.org/mongo-driver/bson"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
)

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    client, err := mongo.Connect(ctx, options.Client().ApplyURI("mongodb://localhost:27888").SetAuth(options.Credential{
        Username: "mongoadmin",
        Password: "secret",
    }))

    if err != nil {
        panic(err)
    }

    defer func() {
        if err = client.Disconnect(ctx); err != nil {
            panic(err)
        }
    }()


    collection := client.Database("test").Collection("events")

    var bs int32 = 10000
    var b = true
    cur, err := collection.Find(context.Background(), bson.D{}, &options.FindOptions{
        BatchSize: &bs, NoCursorTimeout: &b})
    if err != nil {
        log.Fatal(err)
    }
    defer cur.Close(ctx)

    s, n := runningtime("retrive db from mongo and publish to kafka")
    count := 0
    for cur.Next(ctx) {
        var result bson.M
        err := cur.Decode(&result)
        if err != nil {
            log.Fatal(err)
        }

        bytes, err := json.Marshal(result)
        if err != nil {
            log.Fatal(err)
        }
        count++

        msg := &sarama.ProducerMessage{
            Topic: "hello",
            // Key:   sarama.StringEncoder("aKey"),
            Value: sarama.ByteEncoder(bytes),
        }
        asyncProducer.Input() <- msg
    }


But the the program only retrives only about 600,000 records instead of 2,000,000 every times I ran the program.

$ go run main.go
done
count = 605426
nErrors = 0
2020/09/18 11:23:43 End:         retrive db from mongo and publish to kafka took 10.080603336s

I don't know why? I want to retrives all 2,000,000 records. Thanks for any help.

icza
  • 389,944
  • 63
  • 907
  • 827
Anh Thi
  • 51
  • 1
  • 5

1 Answers1

5

Your loop fetching the results may end early because you are using the same ctx context for iterating over the results which has a 10 seconds timeout.

Which means if retrieving and processing the 2 million records (including connecting) takes more than 10 seconds, the context will be cancelled and thus the cursor will also report an error.

Note that setting FindOptions.NoCursorTimeout to true is only to prevent cursor timeout for inactivity, it does not override the used context's timeout.

Use another context for executing the query and iterating over the results, one that does not have a timeout, e.g. context.Background().

Also note that for constructing the options for find, use the helper methods, so it may look as simple and as elegant as this:

options.Find().SetBatchSize(10000).SetNoCursorTimeout(true)

So the working code:

ctx2 := context.Background()

cur, err := collection.Find(ctx2, bson.D{},
    options.Find().SetBatchSize(10000).SetNoCursorTimeout(true))

// ...

for cur.Next(ctx2) {
    // ...
}

// Also check error after the loop:
if err := cur.Err(); err != nil {
    log.Printf("Iterating over results failed: %v", err)
}
icza
  • 389,944
  • 63
  • 907
  • 827
  • Thank you so much for helping me out. It retrieves all 2 million recrods if I use `context.Background` for connection. But doesnt retrive all if i pass the context.Background() to collection.Find() – Anh Thi Sep 18 '20 at 06:57
  • 1
    What matters is to use `context.Background()` for `cur.Next()` as well, as indicated in my answer. – icza Sep 18 '20 at 07:02
  • Oh, I finally understand why. Thanks you so much, it works! – Anh Thi Sep 18 '20 at 07:06
  • 1
    @AnhThi Also see edited answer, the options for `find` can be constructed much easier / more elegantly. – icza Sep 18 '20 at 07:20
  • Thank you @icza, it looks elegant. – Anh Thi Sep 18 '20 at 07:24