3

I am trying to understand the internal allocation and placement of arrays and hashes (which, from my understanding are implemented through arrays) in MongoDB documents.

In our domain we have documents with anywhere between thousands and hundreds of thousands of key-value pairs in logical groupings up to 5-6 levels deeps (think nested hashes).

We represent the nesting in the keys with a dot, e.g., x.y.z, which upon insertion into MongoDB will automatically become something like:

{
    "_id" : "whatever",
    "x" : {
        "y" : {
            "z" : 5
        }
    }
}

The most common operation is incrementing a value, which we do with an atomic $inc, usually 1000+ values at a time with a single update command. New keys are added over time but not frequently, say, 100 times/day.

It occurred to me that an alternative representation would be to not use dots in names but some other delimiter and create a flat document, e.g.,

{
    "_id" : "whatever",
    "x-y-z" : 5
}

Given the number of key-value pairs and the usage pattern in terms of $inc updates and new key insertion, I am looking for guidance on the trade-offs between the two approaches in terms of:

  • space overhead on disk

  • performance of $inc updates

  • performance of new key inserts

Sim
  • 13,147
  • 9
  • 66
  • 95

1 Answers1

2

The on-disk storage of documents in MongoDB is in BSON format. There is a detailed description of the BSON format here: - http://bsonspec.org/#/specification

While there is some disk savings from using short key names (since, as you can see by looking at the spec, the key name is embedded in the document), it looks to me like there'd be almost no net difference between the two designs in terms of on-disk space used -- the extra bytes you use by using the delimiters (-) get bought back by not having to have string terminators for the separate key values.

$inc updates should take almost identical times with both formats, since they're both going to be in-memory operations. Any improvements in in-memory update time are going to be the tiniest of rounding errors compared to the time taken to read the document off of disk.

The performance of new key inserts should also be virtually identical. If adding the new key/value pair leaves the new document small enough to fit in the old location on disk, then all that happens is the in-memory version is updated and a journal entry gets written. Eventually, the in-memory version will be written to disk.

New key inserts are more problematic if the document grows beyond the space previously allocated for it. In that case, the server must move the document to a new location and update all indexes pointing to that document. This is generally a slower operation, and should be avoided However, the schema changes that you're discussing shouldn't affect the frequency of document movement. Again, I think this is a wash.

My suggestion would be to use the schema that most lends itself to developer productivity. If you're having performance problems, then you can ask separate questions about how you can either scale your system or improve performance, or both.

William Z
  • 10,989
  • 4
  • 31
  • 25
  • 1
    Reading the BSON spec it looks like arrays and documents cannot have any padding for future use. Do you read it the same way? This seems a bit strange: to add a single key in a 100K doc they may need to modify many blocks on disk as up to 100K of data may need to be moved by a few bytes. – Sim Sep 03 '12 at 01:51
  • You're right about the spec. MongoDB can allocate additional space for the document (a padding factor) above what the spec allows: http://www.mongodb.org/display/DOCS/Padding+Factor In addition, you can use a manual padding factor when you initially create the document: http://www.mongodb.org/display/DOCS/Padding+Factor#PaddingFactor-ManualPadding – William Z Sep 03 '12 at 14:25
  • In MongoDB, when a document outgrows its slot, only that document is moved: the documents around it are left untouched. When a document is moved, it is moved into a new record which is big enough to hold it's new size (plus any padding factor). The extra I/O comes from re-indexing, not from moving other documents. – William Z Sep 03 '12 at 14:27
  • So it is as a feared... My comment was not about moving documents but modifying blocks on disk. In my case, documents occupy many blocks on disk. With a BSON encoding format that has no internal padding, when a document changes you may have to update all blocks on disk. That's pretty expensive from an I/O standpoint. A slightly better approach would be to come up with an encoding format that can be tuned to the block size on the storage device and that can add padding occasionally to minimize the likelihood of multi-block updates, e.g., when an array element is removed via $pull. – Sim Sep 04 '12 at 00:35
  • 1
    Since most applications are read-heavy, the design of MongoDB is read-optimized. In terms of I/O, it's faster to read a document if the entire document is in one contiguous location on the disk: the design of MongoDB reflects this. The price of this is slightly higher I/O requirements when performing only those writes which grow the document. This design seems to work well in practice. CF Amdhal's Law and Donald Knuth on premature optimization – William Z Sep 04 '12 at 14:04
  • You get an up vote for channeling grandpa Knuth. In my case, there is nothing premature, though. These updates are the slowest part of the system and they consume a big portion of the time. Also, whether there is padding or not *within* the document encoding as opposed to just at the end is a completely separate issue from whether the document is stored contiguously on disk. Your point is about eliminating random disk access. My point was about potentially unnecessary block updates on disk. That's apples & oranges. – Sim Sep 04 '12 at 22:43
  • If you have a performance problem, I invite you to post a separate SO question about that problem, so that we can diagnose the cause and propose a remedy. Slow performance is usually handled by good schema design, query optimization, or system tuning. – William Z Sep 05 '12 at 19:37