1

I have a collection with approximately 500 million documents where it seems the uniqueness constraint has not been enforced on a specific subset of these. The uniqueness contraint applies to a compound index.

The indices on this collection:

db.elements.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "test.elements"
        },
        {
                "v" : 1,
                "key" : {
                        "sessionId" : 1
                },
                "name" : "sessionId_1",
                "ns" : "test.elements"
        },
        {
                "v" : 1,
                "key" : {
                        "sessionId" : 1,
                        "modelFolder" : 1,
                        "modelName" : 1
                },
                "name" : "sessionId_1_modelFolder_1_modelName_1",
                "ns" : "test.elements",
                "options" : {
                        "unique" : true
                }
        },
        {
                "v" : 1,
                "key" : {
                        "id" : 1
                },
                "name" : "id_1",
                "ns" : "test.elements",
                "options" : {
                        "unique" : false
                }
        },
        {
                "v" : 1,
                "key" : {
                        "uniqueId" : 1
                },
                "name" : "uniqueId_1",
                "ns" : "test.elements"
        }
]

When I run the following query I get duplicates despite the query matching on the compound index fields of the index 'sessionId_1_modelFolder_1_modelName_1' (specific field values redacted due to IP concerns):

var gs = (
    db
    .elements
    .aggregate(
        [
          {
            $match : {
              "sessionId" : (specific sessionId value),
              "modelName" : (specific modelName value),
              "modelFolder" : (specific modelFolder value)
            }
          },
          {
            $group : {
              _id : "$id",
              total : { $sum : 1 }
            }
          }
        ]  
      )
  );

gs.forEach(
    function(g) { printjson(g); }
  );

A subset of the output:

{ "_id" : 1394912, "total" : 2 }
{ "_id" : 1394916, "total" : 2 }
{ "_id" : 1394914, "total" : 2 }
{ "_id" : 1394909, "total" : 2 }
{ "_id" : 1394877, "total" : 2 }
{ "_id" : 1394908, "total" : 2 }
{ "_id" : 1394900, "total" : 2 }
{ "_id" : 1394906, "total" : 2 }
{ "_id" : 1394907, "total" : 2 }
{ "_id" : 1394876, "total" : 2 }
{ "_id" : 1394904, "total" : 2 }
{ "_id" : 1394902, "total" : 2 }
{ "_id" : 1394903, "total" : 2 }
{ "_id" : 1394881, "total" : 2 }
{ "_id" : 1394859, "total" : 2 }
{ "_id" : 1394901, "total" : 2 }
{ "_id" : 1394878, "total" : 2 }
{ "_id" : 1394880, "total" : 2 }
{ "_id" : 1394857, "total" : 2 }
{ "_id" : 1394875, "total" : 2 }

I had killed a batch insert of this subset of documents then re-bulk inserted them later on but I'm surprised this somehow allows duplicates. Am I going crazy or is this possible under certain conditions?

Dr DR
  • 667
  • 1
  • 5
  • 8
  • Can you create a test subset? And are the duplicates in your collection? I see a aggregation with a group operator, i asume there is a unwind before. Isn't that unwind creating the duplicate entries... – HoefMeistert Nov 01 '16 at 07:52
  • There is no unwind (that I'm aware of). Actually I've realized the grouping is redudant in this example, just showing that there are ANY duplicates little-own duplicates among the grouped ids of the duplicate is a redudant example. I'll try to recreate this scenario on a test subset but I fear this has come about due to some undefined state caused by a cancellation of a bulk insert operation... – Dr DR Nov 01 '16 at 09:21
  • Could it be that MongoDB requires that the insert / query match the index's order of fields? That is unfortunate to say the least!! – Dr DR Nov 01 '16 at 09:22
  • No when inserting documents and it will give you an error when you try to insert a document with duplicate keys. you could try it with a small subset. – HoefMeistert Nov 01 '16 at 09:26

1 Answers1

1

_id is unique for a collection in mongodb.

Duplicate id's can occur if we are in sharded environment.

To avoid duplicates use the {unique: true} option to ensure that the underlying index enforces uniqueness so long as the unique index is a prefix of the shard key.

If the "unique: true" option is not used, the shard key does not have to be unique.

References:

https://docs.mongodb.com/manual/sharding/

Duplicate documents on _id (in mongo)

Read this post for getting more info on duplicate _id across different collections

Community
  • 1
  • 1
Clement Amarnath
  • 5,301
  • 1
  • 21
  • 34
  • 1
    Yes but I was referring to the duplicate compound field values I've found despite the fact that those fields form a compound index with the unique constraint. I don't understand how this is possible but I can't see where I've gone wrong in my analysis of the situation. – Dr DR Nov 02 '16 at 08:20