named keys vs numerical keys - mongo

Question

Imagine we have a Mongo instance used solely for cache. The collection simply is an array dump of the following

_id
key
value
expiration

However, someone on our project team. Stores it as

_id
0
1
2

Our backend (PHP), knows that 0 = "key", 1 = "value", 2 = "expiration". He said, "Its best to do it this way, so we aren't storing a long key name in every record in Mongo"

This made sense originally to me, since each document is stored on its own. However, using any management tool or trying to manipulate our data outside of our application is near impossible. Its like looking at 1's and 0's. So I went out to test this.

I made a small Mongo collection of named keys and numerics. After doing this. I ran a db.foo.stats() on both of them.

They matched every stat. So I guess my question is. If we have a key named VeryLongKeyDescriptiveText, and its stored in 1000 records. Is that the same physical size as storing 0 and the respective value? (My test says yes, but I don't understand how Mongo does this).

My testing is two collections (control & test). Using the two above key-value setups. Each collection currently has 3 documents that consist of a name, some base64 loren ipsum text, and unix timestamp for expiration. Both collections have the same exact data, with the exception of the keys which in the test are (0,1,2) instead of (key, value, expiration). Here are the outputs of stats() for both of them: http://pastebin.com/tTt7VzwQ

Possible duplicate: http://stackoverflow.com/questions/12790861/is-shortening-properties-names-worth-it — heinob, Jan 27 '14 at 13:29
@heinob The accepted answer in that question doesn't agree with my tests. — Connor Tumbleson, Jan 27 '14 at 13:31
could you please provide your testing strategy? (and db.foo.stats() output) — xlembouras, Jan 27 '14 at 13:39

score 3 · Answer 1 · answered Jan 27 '14 at 14:06

It is true that the difference in database size between the two opinions is normally way out of proportion and in reality you might only see maybe 1MB difference in 1,000 records between short field names and long field names.

Sometimes if you use long field names consistently and they are quite some length you will start to see real problems.

The identification of this problem can also be linked to document content size. I mean if your documents are already big then your not going to notice too much of a change.

They matched every stat.

I would say that is luck more than short field names being the same size as long field names.

Are you sure the data is the same between the two except that one has short and one has long field names?

I really cannot see how it is physically possible that expiration would be the same size as 2, I can see how key and 0 might be relatively the same number of bytes.

The data is the same. I simply made the data in the collection with long field names. Duplicated it. Then renamed the keys in the new collection, repaired and compacted the collections then ran stats. — Connor Tumbleson, Jan 27 '14 at 14:34

score 2 · Accepted Answer · answered Feb 13 '14 at 18:13

If you expand your test case to create larger documents, the storage differences become more apparent. Consider the following:

function createIntFields(j) {
  var document = {};
  for (i = 0; i < j; i++) {
    document[i] = i;
  }
  return document;
}

function createStringFields(j) {
  var document = {};
  for (i = 0; i < j; i++) {
    document["thisIsAVeryLongFieldNamePrefix" + i] = i;
  }
  return document;
}

db.int.drop();
for (i = 0; i < 1000; i++) { db.int.insert(createIntFields(i)); }

db.string.drop();
for (i = 0; i < 1000; i++) { db.string.insert(createStringFields(i)); }

The stats do differ quite a bit (I removed some irrelevant output fields):

> db.int.stats();
{
    "ns" : "test.int",
    "count" : 1000,
    "size" : 9395008,
    "avgObjSize" : 9395,
    "storageSize" : 11182080,
    "numExtents" : 6,
    "lastExtentSize" : 8388608
}
> db.string.stats();
{
    "ns" : "test.string",
    "count" : 1000,
    "size" : 32098752,
    "avgObjSize" : 32098,
    "storageSize" : 37797888,
    "numExtents" : 8,
    "lastExtentSize" : 15290368
}

To explain what you're seeing with small document sizes, we can refer to Mathias Stearn's storage internals presentation, specifically slide #25. Each record (e.g. document in this case) has 16-bytes of overhead for the record length, extent, and next/prev pointers. In addition to that, the minimum payload for a document is 32-bytes. Therefore, even if we fill a collection with very tiny documents:

db.foo.drop();
for (i = 0; i < 1000; i++) { db.foo.insert({_id:i}); }

The stats will show an average document size of 48:

> db.foo.stats()
{
    "ns" : "test.foo",
    "count" : 1000,
    "size" : 48032,
    "avgObjSize" : 48,
    "storageSize" : 172032,
    "numExtents" : 3,
    "lastExtentSize" : 131072
}

When a document payload surpasses 32-bytes, power-of-two allocation kicks in, so you still may see documents allocated in round chunks. In some of my tests, I noticed 112 was a common average size (96 + 16).

score 1 · Answer 3 · answered Jan 27 '14 at 13:32

1

My first thought was that they had implemented compression or tokenization of field names, but that issue still seems to be unresolved (as of Jan. 2014). They are probably the same size because of padding. The documents in your collection are padded for performance reasons, so they can often be resized in place without having to be moved around. You could try compacting the collection without any padding, to see if you see a difference now.

answered Jan 27 '14 at 13:32

Mzzl

3,926
28
39

hm. same size after compacting. I'll add some more documents to my test & control. Maybe its too small to make any comparison. – Connor Tumbleson Jan 27 '14 at 13:36

named keys vs numerical keys - mongo

3 Answers3