31

We all know that for relational databases it is best practice to use numerical IDs for the primary key.

In couchdb the default ID that is generated is a UUID. Is it best to stick with the default, or use an easily memorable identifier that will be used in the application by the user?

For example, if you were designing the stackoverflow.com database in couchdb, would you use the question slug (eg. what-is-best-practice-when-creating-document-ids-in-couchdb) or a UUID for each document?

andyuk
  • 38,118
  • 16
  • 52
  • 52

6 Answers6

20

I'm no couchdb expert, but after having done a little research this is what I've found.

The simple answer is, use UUIDs unless you have a good reason not to.

The longer answer is, it depends on:

Cost of changing ID Vs How likely the ID is to change

Low cost of changing and likely to change ID

An example of this might be a blog with a denormalized design such as jchris' blog (sofa code available on git hub).

Every time another website links to a blog post, this is another reference to the id, so the cost of changing the id increases.

High cost of changing ID and an ID that will never change

An example of this is any DB design that is highly normalized that uses auto-increment IDs. Stackoverflow.com is a good example with its auto-incrementing question IDs that you see in every URL. The cost of changing the ID is extremely high since every foreign key would need to be updated.

How many references, or "foreign keys" (in relational DB language) will there be to the id?

Any "foreign keys" will greatly increase the cost of changing the ID. Having to update other documents is a slow operation and definitely should be avoided.

How likely is the ID to change?

If you are not wanting to use UUIDs you probably already have an idea of what ID you want to use.

If it is likely to change, the cost of changing the ID should be low. If it is not, pick a different ID.

What is your motivation for wanting to use an easily memorable ID?

Don't say performance.

Benchmarks show that "CouchDB’s view key lookups are almost, but not quite, as fast as direct document lookups". This means that having to do a search to find a record is no big deal. Don't choose friendly ids just because you can do a direct lookup on a document.

Will you be doing many bulk inserts?

If so, it is better to use incremental UUIDs for better performance.

See this post about bulk inserts. Damien Katz comments and says:

"If you want to have the fastest possible insert times, you should give the _id's ascending values, so get a UUID and increment it by 1, that way it's always inserting in the same place in the index, and being cache friendly once you are dealing with files larger than RAM. For an easier way to do the same thing, just sequentially number the documents but make it fixed length with padding so that they sort correctly, "0000001" instead of "1" for example."

andyuk
  • 38,118
  • 16
  • 52
  • 52
  • 6
    This answer seems predicated on the notion that conflict avoidance is always desirable; however, sometimes conflicts are a natural part of the problem domain, and rather than simply being avoided, they should be proactively detected and resolved. In such cases, a natural ID is an excellent choice. For example, don't use the title of a blog post as an ID on a massively multi-user system, but do use the fully qualified domain name and IP address when modeling DNS address records. – user359996 Oct 05 '10 at 06:21
  • 1
    This article well explains the impact of random UUIDs on CouchDB performance http://blog.inoi.fi/2010/11/impact-of-document-ids-on-performance.html – Lebugg Apr 29 '15 at 11:16
  • 1
    Having used CouchDB in various commercial and open source projects, I fully disagree with this answer. It completely disregards how IDs work in Couch (immutable, used for sorting, must be unique across the whole DB, significance for replication, etc.). – theDmi Aug 17 '16 at 14:05
  • 3
    I know this answer is old and was partially correct at the time. But now this is opposite of what is recommended. You should NOT use UUID's unless you have a good reason to use them. If you dont have a crazy number of doc's being created in a multi user environment then new Date().toISOString() is a good default. – GifCo Mar 15 '17 at 20:17
  • @GifCo and what if the DB is written to in the same second? – shennan Dec 22 '17 at 09:22
  • I ended up mainly following [this advice](https://eager.io/blog/how-long-does-an-id-need-to-be). 4 bytes of timestamp accurate to the second. 3 random generated bytes. Both hashed using a custom base64 where '/' and '+' are replaced with [unreserved characters](https://tools.ietf.org/html/rfc3986#section-2.3) '-' and '.', as well as making the order of the characters match ascii ordering to play nicer with the b+tree balance (not sure why traditional base64 is ordered the way it is). – aaaaaa Apr 05 '18 at 04:17
17

Coming from a relational database point of view, it took me a while to figure out couchdb. But the truth is the opposite of the accept answer;

Instead of using a default uuid, generating a smart id can greatly assist you in retrieving and sorting data.

Say you have a database movies. All documents can be found somewhere under the URL /movies, but where exactly?

If you store a document with the _id Jabberwocky ({"_id":"Jabberwocky"}) into your movies database, it will be available under the URL /movies/Jabberwocky. So if you send a GET request to /movies/Jabberwocky, you will get back the JSON that makes up your document ({"_id":"Jabberwocky"}).

http://guide.couchdb.org/draft/documents.html

Performance tip: if you're just using the randomly-generated doc IDs, then you're not only missing out on an opportunity to get a free index – you're also incurring the overhead of building an index you're never going to use. So use and abuse your doc IDs!

https://pouchdb.com/2014/05/01/secondary-indexes-have-landed-in-pouchdb.html

TimoSolo
  • 7,068
  • 5
  • 34
  • 50
  • 3
    This should be the accepted answer. If you cleverly utilize the `_id` for your queries, you don't need to create additional indices that decrease performance. For example, if you have documents of different `type` with a unique `uid` and additionally need to sort them via a timestamp, create an `_id` like `${type}_${timestamp}_${uid}`. – LyteFM Apr 11 '20 at 13:33
3

I realize this is a long-answered question, but there's another important consideration for those discovering the issue. When a document is deleted, all you know about it is the id. Typing, whether explicit (type:foo) or implied (duck typing) doesn't work. So you can't subscribe to changes for doc.deleted===true && doc.type==foo, because after the delete, doc.type===undefined. An _id value that you can decode post-hoc is useful, particularly if your client code needs to be otherwise stateless (and can't therefore cache a list of _ids by type).

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Jim
  • 116
  • 3
  • 1
    I realize this is an old response, but you can get around that by, instead of issuing a DELETE on the document, updating the document with a field `"_deleted": true` in the root. However, ensuring your code only uses this strategy would probably be painful and error-prone. – dhasenan Nov 25 '15 at 23:53
0

The _id is used a lot in the CouchDB internals and any extra hashing cost is going to slow down a bunch of the internals so it's best to stick with the UUID provided.

mikeal
  • 4,037
  • 2
  • 27
  • 22
  • 5
    I'm confused. What do you mean by "extra hashing cost"? Are you saying a user-generated ID will end up hashed, internally, whereas an auto-generated UUID will not? – user359996 Oct 05 '10 at 06:01
  • Might be referring to the length of an _id (higher cost to hash a longer string)? – Nevir Sep 24 '11 at 16:47
0

You could go with the default CouchDB id(UUID), as it said in the documentation the main reasons to use default UUID are as follow:

  • UUIDs are random numbers that have such a low collision probability that everybody can make thousands of UUIDs a minute for millions of years without ever creating a duplicate.This is a great way to ensure two independent people cannot create two different documents with the same ID.
  • CouchDB replication lets you share documents with others and using UUIDs ensures that it all works.

Now, On the other hand, If you rely on the server(CouchDB) to generate the UUID and you end up making two POST requests because the first POST request bombed out, you might generate two docs and never find out about the first one because only the second one will be reported back, so, it's a good idea to generate your own UUIDs to make sure that you’ll never end up with duplicate documents, but I definitely will go with UUID unless you specifically need otherwise. documenta.

Mike
  • 1
  • 1
-3

The primary key in a DB should never have any "meaning" except maybe to encode sequence. You might want to change the SLUG but not the primary key.

There might be an good argument to use something starting with a timestamp to have inherent ordering in your keys. I often use "%f@%s" % (time(), hostname()) to get ordered, unique keys. (This works only if your time() implementation never returns the same value twice.)

For other stuff (e.g. images) , where I want to avoid duplicates I often use sha(data) as the key.

max
  • 29,122
  • 12
  • 52
  • 79