5

I have a sharded mongo collection, with over 1.5 mil documents. I use the _id column as a shard key, and the values in this column are integers (rather than ObjectIds).

I do a lot of write operations on this collection, using the Perl driver (insert, update, remove, save) and mongoimport.

My problem is that somehow, I have duplicate documents on the same _id. From what I've read, this shouldn't be possible.

I've removed the duplicates, but others still appear.

Do you have any ideas where could they come from, or what should I start looking at? (Also, I've tried to replicate this on a smaller, test collection, but no duplicates are inserted, no matter what write operation I perform).

klaoo z
  • 83
  • 1
  • 1
  • 7

2 Answers2

4

This actually isn't a problem with the Perl driver .. it is related to the characteristics of sharding. MongoDB is only able to enforce uniqueness among the documents located on a single shard at the time of creation, so the default index does not require uniqueness.

In the MongoDB: Configuring Sharding documentation there is specific mention that:

  • When you shard a collection, you must specify the shard key. If there is data in the collection, mongo will require an index to be created upfront (it speeds up the chunking process); otherwise, an index will be automatically created for you.

  • You can use the {unique: true} option to ensure that the underlying index enforces uniqueness so long as the unique index is a prefix of the shard key.

  • If the "unique: true" option is not used, the shard key does not have to be unique.

Stennie
  • 63,885
  • 14
  • 149
  • 175
  • I just checked that, and you were right... the "unique:true" option was not specified :) Thanks a lot, your answer was extremely helpful. – klaoo z Jun 28 '12 at 13:28
  • FYI, noticed there is a new tutorial: [MongoDB: Enforce Unique Keys for Sharded Collections](http://docs.mongodb.org/manual/tutorial/enforce-unique-keys-for-sharded-collections/). – Stennie Jun 29 '12 at 00:50
1

How have you implemented generating the integer Ids?

If you use a system like the one suggested on the MongoDB website, you should be fine. For reference:

function counter(name) {
    var ret = db.counters.findAndModify({
         query:{_id:name}, 
         update:{$inc:{next:1}}, 
         "new":true, 
         upsert:true});

    return ret.next;
}

db.users.insert({_id:counter("users"), name:"Sarah C."}) // _id : 1
db.users.insert({_id:counter("users"), name:"Bob D."}) // _id : 2

If you are generating your Ids by reading a most recent record in the document store, then incrementing the number in the perl code, then inserting with the incremented number you could be running into timing issues.

Mark Withers
  • 1,462
  • 5
  • 16
  • 34
  • Here are some more details about the process: I receive daily data, which I process, and insert/update in the collection. For the _id column, I am using the same id I get from the outer source (for performance purposes). – klaoo z Jun 28 '12 at 11:26