3

I sharded my mongoDB cluster by hashed _id. I checked the index size, there lies an _id_hashed index which is taking much space:

   "indexSizes" : {
           "_id_" : 14060169088,
           "_id_hashed" : 9549780576
    },

mongoDB manual says that an index on the sharded key is created if you shard a collection. I guess that is the reason the _id_hashed index is out there.

My question is : what is the _id_hashed index for if I only query document by the _id field? can I delete it? as it takes too much space.

ps: it seems mongoDB use the _id index when query, not the _id_hashed index. execution plan for a query:

   "clusteredType" : "ParallelSort",
    "shards" : {
            "rs1/192.168.62.168:27017,192.168.62.181:27017" : [
                    {
                            "cursor" : "BtreeCursor _id_",
                            "isMultiKey" : false,
                            "n" : 0,
                            "nscannedObjects" : 0,
                            "nscanned" : 1,
                            "nscannedObjectsAllPlans" : 0,
                            "nscannedAllPlans" : 1,
                            "scanAndOrder" : false,
                            "indexOnly" : false,
                            "nYields" : 0,
                            "nChunkSkips" : 0,
                            "millis" : 0,
                            "indexBounds" : {
                                    "start" : {
                                            "_id" : "spiderman_task_captainStatus_30491467_2387600"
                                    },
                                    "end" : {
                                            "_id" : "spiderman_task_captainStatus_30491467_2387600"
                                    }
                            },
                            "server" : "localhost:27017"
                    }
            ]
    },
    "cursor" : "BtreeCursor _id_",
    "n" : 0,
    "nChunkSkips" : 0,
    "nYields" : 0,
    "nscanned" : 1,
    "nscannedAllPlans" : 1,
    "nscannedObjects" : 0,
    "nscannedObjectsAllPlans" : 0,
    "millisShardTotal" : 0,
    "millisShardAvg" : 0,
    "numQueries" : 1,
    "numShards" : 1,
    "indexBounds" : {
            "start" : {
                    "_id" : "spiderman_task_captainStatus_30491467_2387600"
            },
            "end" : {
                    "_id" : "spiderman_task_captainStatus_30491467_2387600"
            }
    },
    "millis" : 574
Chris Martin
  • 30,334
  • 10
  • 78
  • 137
zach
  • 184
  • 2
  • 8

4 Answers4

3

MongoDB uses a range based sharding approach. If you choose to use hashed based sharding, you must have a hashed index on the shard key and cannot drop it since it will be used to determine shard to use for any subsequent queries ( note that there is an open ticket to allow you to drop the _id index once hashed indexes are allowed to be unique SERVER-8031 ).

As to why the query appears to be using the _id index rather than the _id_hashed index - I ran some tests and I think the optimizer is choosing the _id index because it is unique and results in a more efficient plan. You can see similar behavior if you shard on another key that has a pre-existing unique index.

jeffl
  • 446
  • 3
  • 2
0

If you sharded on a hashed _id then that's the type of index that was created.

When you did sh.shardCollection( 'db.collection', { _id:"hashed" } ) you told it you wanted to use a hash of _id as the shard key which requires a hashed index on _id.

So, no, you cannot drop it.

Asya Kamsky
  • 41,784
  • 5
  • 109
  • 133
  • yes, I've found that on the mongoDB manual. what puzzling me is: what is this hashed index for? can you explain more about how mongoDB will utilize this index? thanks. – zach Sep 04 '13 at 03:40
  • 1
    if you want to use a hash of _id values for your shard key, mongoDB will need an index on the shard key (to look up ranges quickly) and so it needs an index on hashes of _id. (in addition to index on actual _id values which is mandatory). – Asya Kamsky Sep 04 '13 at 03:48
  • why does mongoDB needs to look for ranges in hashed key index? when migrating chunks to other shards? – zach Sep 04 '13 at 04:57
  • it's not specifically for ranges (I should have used more general search example) rather for any look-ups. – Asya Kamsky Sep 04 '13 at 07:25
  • I only need to query on the _id field. and I check the execution plan(see my original post), It seems the _id_hashed index is not used in such a case. when and how will the index be used? – zach Sep 04 '13 at 08:28
0

The documentation goes into detail exactly what a hashed index is which puzzles me how you have read the documentation but don't know what the hashed index is for.

The index is mainly to stop hot spots within shard keys that may not be evenly distributed with their reads/writes.

So imagine the _id field, it is an ever increasing range, all new _ids will be after, this means that you are always writing at the end of your cluster, creating a hot spot.

As for reading it can be quite common that you only read the newest documents, as such this means the upper range of the _id key is the only one that's used making for a hot spot of both reads and writes in the upper range of the cluster while the rest of your cluster just sits there idle.

The hash index takes this bad shard key and hashes it in such a way that means it is not ever increasing but instead will create an evenly distributed set of data for reads and writes, hopefully cuasing the entire set to be utilised for operations.

I would strongly recommend you do not delete it.

Sammaye
  • 43,242
  • 7
  • 104
  • 146
  • for the hot spot thing, that is why I've used the hashed _id field as a sharding key. – zach Sep 04 '13 at 07:46
  • @zach so hang on what is your question? – Sammaye Sep 04 '13 at 07:48
  • Maybe you misunderstood my question. I understand the hot spot thing, that is why I've used the hashed _id field as a shard key, not just an _id field. put it another way, If I just use the _id field as the shard key, the situation is easy: I will just have an _id_ index. mongoDB use this index to search for a query on the _id field. when I used the hashed _id as the shard key, MongoDB generates another _id_hashed index, I don't understand when & how mongoDB use this index? – zach Sep 04 '13 at 07:56
  • @zach MongoDB will use that index instead of the _id index, so it will actually use the hashed index completely, that _id_hashed IS the hashed index – Sammaye Sep 04 '13 at 07:58
  • I am afraid that is incorrect. "indexBounds" : { "start" : { "_id" : "spiderman_task_captainStatus_30491467_2387600" }, "end" : { "_id" : "spiderman_task_captainStatus_30491467_2387600" } }, – zach Sep 04 '13 at 08:05
  • check the execution plan in my original post. – zach Sep 04 '13 at 08:29
  • @zach After actually setting up a test envo for this I believe explain works differntly, if you do sh.status you will see the actual bounds for mongodb internally. Well actually no you won't, I think the _id_hashed is something not publicly shown, Asya would know more – Sammaye Sep 04 '13 at 08:32
  • @zach Yeah after a lot more testing, it seems that even though the index is used MongoDB will never show it uses the index – Sammaye Sep 04 '13 at 10:10
  • how can you be sure the index was used? can you post your results somewhere? – zach Sep 04 '13 at 10:17
  • @zach I dunno actually further tests have confused me, my tests are showing that actually a hashed index has no effect on look ups data distribution etc etc, in fact it is showing no change from the normal _id field at all. I have had to post a question on Google groups about it in fact...this isn't anything like what the documentation says – Sammaye Sep 04 '13 at 10:59
  • @zach Well, when a look up is performed there has to be a map between the hashed value and the real value, the query cannot know the real value of the hash, and the query will use the map to understand what shards it should target through the hashed shard key, that *should* be the hash index – Sammaye Sep 04 '13 at 14:22
  • I'll go check the google groups. In my opinion, sharding key is used to determine which shard it belongs, it doesn't have to mean that in the specific shard mongoDB should use the corresponding index to look for the document.In my case, once the shard is determined, the shard can use the _id_ index to find the document, it doesn't need to calculate the key's hash and use the hash value to index into the _id_hashed index. I can not find the internal indexing mechanism anywhere, maybe we have to check the source code. – zach Sep 04 '13 at 14:26
  • @zach That is right but once it is there it must use the hashed value for the balancer to work correctly with this else it will just rebalance back again using the old _id value. I beleive the shard key must remain hashed as such the actual key used is a hash of the _id which must also be used for querying in that case but yeah, Google groups will say – Sammaye Sep 04 '13 at 14:31
  • Why the query cannot know the real value of the hash? In my case, the real value is the _id field, The query already has the _id field in it and it is passed to the specific shard, right? Then why should the _id_hashed index is needed? – zach Sep 04 '13 at 14:35
  • @zach Hang on I have just noticed, your explain is a scatter and gather operation, it isn't targeted. It cannot know the real values because hash is one way. A hash is not encryption. As such the _id_hashed index makes sense as a map between the two – Sammaye Sep 04 '13 at 14:37
  • @zach Oh nevermind I misread parellel, it just refers to how the shards were accessed – Sammaye Sep 04 '13 at 14:39
  • let's wait and see if guys there can give a good explanation. I posted the same question on that group but haven't got an good answer. It's strange your posts should be checked there. – zach Sep 04 '13 at 14:41
  • I don't get why we need a reverse map from hash to _id. I've thought of one case when the _id_hashed index may be used: when the chunk splits into two. That's just a guess, need confirmation. – zach Sep 04 '13 at 15:10
  • @zach Yeah done some more tests on the google group, appears I was right, the hashed is use for lookup and it is working, I just misread my status() output – Sammaye Sep 05 '13 at 07:29
  • @zach In fact Jeffs answer on google groups proves it is using the hashed key for lookups – Sammaye Sep 05 '13 at 07:31
0

Hashed index is reqired by sharded collection, more exactly, hashed index is reqired by sharding balancer to find documents based on hash value directly, normal query operations dose not require an index to be hasded index, even on shared collection.

Joseph
  • 111
  • 7