11

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).

However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).

My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?

Update

For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.

Sammaye provides an excellent answer here: Is it bad to change _id type in MongoDB to integer?

AJB
  • 7,389
  • 14
  • 57
  • 88
  • 1
    I can't think of one. Perhaps adding another "Key" you can build an index on to make look ups faster. – Matt Mar 05 '14 at 17:27
  • 1
    related to: http://stackoverflow.com/questions/14054384/is-it-bad-to-change-id-type-in-mongodb-to-integer/14058189#14058189 – Sammaye Mar 05 '14 at 17:32
  • 1
    The only benefit i can see is Int64 is 8 bytes and BSON is 12 bytes ,so you can save your some space . – Sumeet Kumar Yadav Mar 05 '14 at 18:12
  • Thanks a tonne, folks. @Sammaye, very well explained in the link you provided. sumeet, thanks for the info, didn't even think of that really. matt, this is what I'll do when (if) I really need to assign my own IDs. – AJB Mar 05 '14 at 19:24

5 Answers5

10

Advantages with generating your own _ids:

  • You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...

  • Or you can make them more human-friendly, using random strings: t3oSKd9q

    (That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)

  • If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)

  • Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)

Advantages to using ObjectIds:

  • ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.

  • ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)

The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.

Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).


Sharding strategies

Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.

I won't claim to be an expert on how best to shard data, but here are some situations we might consider:

  1. An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.

  2. You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)

  3. You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.

  4. Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.

    But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.

You may like to read the official docs on this subject:

joeytwiddle
  • 29,306
  • 13
  • 121
  • 110
  • Love this answer, want to accept it, but there's two points making my brain itch. 1) Why are you describing an 8-char random string as "more human-friendly"? 2) Why are you arguing that randomly-generated strings provide better shard distribution? That doesn't seem correct to me. An `_id` often serves as the shard key — grouping related docs by a common key. Why would random strings provide "better shard distribution"? – AJB Oct 26 '21 at 07:07
  • @AJB 1. My feeling is that a short alphanumeric string looks more attractive to humans, and in a pinch it can be typed more easily, than the longer hex string of an ObjectID. YouTube uses such strings to ID their videos. (Being case agnostic, perhaps showing all in uppercase, would be even easier for humans to copy, but sacrifices a lot of range. That's a compromise for the reader to consider.) 2. I have edited "better distribution" to "more even distribution" because that's more accurate. Grouping similar docs into one shard may or may not be desirable, depending on the application. – joeytwiddle Oct 26 '21 at 14:29
  • I have one question about human-friendly, using random strings. [Will rand string as custom _id make B tree index split higher frequency in MongoDB?](https://stackoverflow.com/questions/67732788/will-rand-string-as-custom-id-make-b-tree-index-split-higher-frequency-in-mongo?noredirect=1&lq=1) – zangw Jan 29 '22 at 11:55
6

I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.

Let me explain. The reason people might consider re-try logic: Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.

For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.

Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.

Although this sort of resiliency may be overkill in some types of projects, it really just depends.

7wp
  • 12,505
  • 20
  • 77
  • 103
  • Very good point @7wp. An edge-case for sure, but some robust thinking that may just help someone that needs to write code in this context. – AJB Oct 26 '21 at 06:56
5

Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.

Will Shaver
  • 12,471
  • 5
  • 49
  • 64
4

I have used custom ids a couple of times and it was quite useful.

In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.

xlembouras
  • 8,215
  • 4
  • 33
  • 42
  • This seems like it might not hold up if your app reaches a high level of concurrency. Say, 1M operations per second, where the timestamp wouldn't be unique across documents because `n` documents are being inserted in the same millisecond. – AJB Mar 05 '14 at 19:33
  • valid point. In my case the application would populate a collection by a batch process (for a single date) and then the results would be presented on a web app. – xlembouras Mar 05 '14 at 19:41
0

I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 22 '21 at 13:54
  • @dennise kinuthia, your above answer reads as though you created a `payment_id` field on your docs in order to act as a UUID for relational purposes. The origin question is really about the canonical `_id` field that is a requirement in MongoDB docs. – AJB Oct 26 '21 at 06:59
  • @Dennis kinuthia, I would encourage you to flush out your above answer a bit more, there may be important information in there, but it's currently not detailed enough (with struct/code examples for illustration purposes), but seems like there will be some interesting information in there. – AJB Oct 26 '21 at 07:01
  • @AJB Pleased check wether you really meant "flush out" (i.e. "get rid of") instead of what I suspect would express what you mean in a basically constructive comment "flesh out" (i.e. augment with more details). – Yunnosch Nov 08 '21 at 21:23