3

I've been reading a lot of best practices and how I should embrace the _id. To be honest, I'm getting my kind of paranoid at the repercussions I might face if I don't do this for when I start scaling up my application.

Currently I have about 50k documents per database. It's only been a few months with heavy usage. I expect this to grow A LOT. I do a lot of .find() Mango Queries, not much indexing; and to be honest working off a relational style document structuring.

For example:

  • First get Project from ID.
  • Then do a find query that:
    • grabs all type:signature where project_id: X.
    • grabs all type:revisions where project_id: X.

The reason for this is I try VERY hard not to update documents. A lot of these documents are created offline, so doing a write once workflow is very important for me to avoid conflicts.

I'm currently at a point of no return as scheduling is getting pretty intense. If I want to change the way I'm doing things now is the best time before it gets too crazy.

I'd love to hear your thoughts about using the _id for data structuring and what people think.

Being able to make one call with a _all_docs grab like this sounds appealing to me:

{
  "include_docs": true,
  "startkey": "project:{ID}",
  "endkey": "project:{ID}:\ufff0"
}

An example of how ONE type of my documents are set is like so:

Main Document

{
    _id: {COUCH_GENERATED_1},
    type: "project",
    ..
    .
}

Signature Document

{
    _id: {COUCH_GENERATED_2},
    type: "signature",
    project_id: {COUCH_GENERATED_1},
    created_at: {UNIX_TIMESTAMP}
}

Change to Main Document

{
    _id: {COUCH_GENERATED_3},
    type: "revision",
    project_id: {COUCH_GENERATED_1},
    created_at: {UNIX_TIMESTAMP}
    data: [{..}]
}

I was wondering whether I should do something like this:

Main Document: _id: project:{kuuid_1}

Signature Document: _id: project:{kuuid_1}:signature:{kuuid_2}

Change to Main Document: _id: project:{kuuid_1}:rev:{kuuid_3}

I'm just trying to set up my database in a way that isn't going to mess with me in the future. I know problems are going to come up but I'd like not to heavily change the structure if I can avoid it.

Another reason I am thinking of this is that I watch for _changes in my databases and being able to know what types are coming through without getting each document every time a document changes sound appealing also.

halfer
  • 19,824
  • 17
  • 99
  • 186
bryan
  • 8,879
  • 18
  • 83
  • 166

1 Answers1

4

Setting up your database structure so that it makes data retrieval easier is good practice. It seems to me you have some options:

  1. If there is a field called project_id in the documents of interest, you can create an index on project_id which would allow you to fetch all documents pertaining to a known project_id cheaply. see CouchDB Find
  2. Create a MapReduce index keyed on project_id e.g if (doc.project_id) { emit(doc.project_id)}. The index that this produces would allow you to fetch documents by known project_id with judicious use of start_key& end_key when querying the view. see Introduction to views
  3. As you say, packing more information into the _id field allows you to perform range queries on the _all_docs endpoint.

If you choose a key design of:

project{project_id}:signature{kuuid}

then the primary index of the database has all of a single project's documents grouped together. Putting the project_id before the ':' character is preparation for a forthcoming CouchDB feature called "partitioned databases", which groups logically related documents in their own partition, making it quicker and easier to perform queries on a single partition, in your case a project. This feature isn't ready yet but it's likely to have a {partition_key}:{document_key} format for the _id field, so there's no harm in getting your document _ids ready for it for when it lands (see CouchDB mailing list! In the meantime, a range query on _all_docs will work.

Glynn Bird
  • 5,507
  • 2
  • 12
  • 21
  • Possibly relevant even if you are using CouchDB without using PouchDB: see also [`relational-pouch`](https://github.com/pouchdb-community/relational-pouch#how-does-it-work). Its source code will show something similar to the answer above, in context. – floer32 May 04 '21 at 20:42