2

I have two sharded collections on 12 shards, with the same number of documents. The shard key of Collection1 is compound (two fields are used), and its document consists of 4 fields. The shard key of Collection2 two is single, and its documents consists of 5 fields.

Via db.collection.stats() command, I get the information about the indexes. What seems strange to me, is that for the Collection1, the total size of _id index is 1342MB. Instead, the total size of the _id index for Collection2 is 2224MB. Is this difference reasonable? I was awaiting that the total size would be more less the same because of the same number of docucments. Note that the sharding key for both collections, does not integrate the _id field.

Nicholas Kou
  • 173
  • 2
  • 13
  • May be this is the reason: [collStats.totalIndexSize](https://docs.mongodb.com/v4.2/reference/command/collStats/#collStats.totalIndexSize) - "_... the returned size reflects the compressed size._" – prasad_ Aug 31 '20 at 08:56
  • @prasad_ yes it reflects the compressed size, but even in that case, wouldn't we expect similar sizes? – Nicholas Kou Aug 31 '20 at 09:12

1 Answers1

1

MongoDB uses prefix compression for indexes.

This means that if sequential values in the index begin with the same series of bytes, the bytes are stored for the first value, and subsequent values contain a tag indicating the length of the prefix.

Depending on the datatype of the _id value, this could be quite a bit.

There may also be orphaned documents causing one node to have more entries in its _id index.

Joe
  • 25,000
  • 3
  • 22
  • 44
  • The data type of the _id value is the default one, based on the ObjectId (Hexadecimal string). Maybe, this happens because of the data distribution among the shards? – Nicholas Kou Aug 31 '20 at 12:36
  • That is entirely possible. You could check that by connecting directly to the primary of each shard and using `db.collection.count()` – Joe Sep 03 '20 at 03:38