8

Should I store objects in an Array or inside an Object with top importance given Write Speed?


I'm trying to decide whether data should be stored as an array of objects, or using nested objects inside a mongodb document.

In this particular case, I'm keeping track of a set of continually updating files that I add and update and the file name acts as a key and the number of lines processed within the file.

the document looks something like this

{
  t_id:1220,
  some-other-info: {}, // there's other info here not updated frequently
  files: {
    log1-txt: {filename:"log1.txt",numlines:233,filesize:19928},
    log2-txt: {filename:"log2.txt",numlines:2,filesize:843}
  }
}

or this

{
  t_id:1220,
  some-other-info: {},
  files:[
    {filename:"log1.txt",numlines:233,filesize:19928},
    {filename:"log2.txt",numlines:2,filesize:843}
  ]
}

I am making an assumption that handling a document, especially when it comes to updates, it is easier to deal with objects, because the location of the object can be determined by the name; unlike an array, where I have to look through each object's value until I find the match.

Because the object key will have periods, I will need to convert (or drop) the periods to create a valid key (fi.le.log to filelog or fi-le-log). I'm not worried about the files' possible duplicate names emerging (such as fi.le.log and fi-le.log) so I would prefer to use Objects, because the number of files is relatively small, but the updates are frequent.

Or would it be better to handle this data in a separate collection for best write performance...

{
    "_id": ObjectId('56d9f1202d777d9806000003'),"t_id": "1220","filename": "log1.txt","filesize": 1843,"numlines": 554
},
{
    "_id": ObjectId('56d9f1392d777d9806000004'),"t_id": "1220","filename": "log2.txt","filesize": 5231,"numlines": 3027
}
Dominic
  • 62,658
  • 20
  • 139
  • 163
Daniel
  • 34,125
  • 17
  • 102
  • 150
  • 2
    a quick test is worth lots of speculation... – dandavis Mar 04 '16 at 20:25
  • what does `t_id` signify? – AxxE Mar 04 '16 at 20:59
  • its an ambiguous id, the significance is that there are multiple `t_id`s and each have multiple `file_name`s (1:m) – Daniel Mar 04 '16 at 21:12
  • This is really more of a question of how you "read" the data than of "write" performance. Clearly if you intend to "read multiple series" at once then it's generally better to keep the data within the same colllection object. If not, and particularly if there is more "create" than "update" then separate collection objects makes much more sense from a "write" perspective. The general difference is reasonably negligable in terms of writing on modern engines, and with separate documents giving you more concurrency with document locking on Wired Tiger. Use to your case. And test, then test again – Blakes Seven Mar 05 '16 at 00:48

1 Answers1

6

From what I understand you are talking about write speed, without any read consideration. So we have to think about how you will insert/update your document.

We have to compare (assuming you know the _id you are replacing, replace {key} by the key name, in your example log1-txt or log2-txt):

db.Col.update({ _id: '' }, { $set: { 'files.{key}': object }})

vs

db.Col.update({ _id: '', 'files.filename': '{key}'}, { $set: { 'files.$': object }})

The second one means that MongoDB have to browse the array, find the matching index and update it. The first one means MongoDB just update the specified field.

The worst: The second command will not work if the matching filename is not present in the array! So you have to execute it, check if nMatched is 0, and create it if it is so. That's really bad write speed (see here MongoDB: upsert sub-document).

If you will never/almost never use read queries / aggregation framework on this collection: go for the first one, that will be faster. If you want to aggregate, unwind, do some analytics on the files you parsed to have statistics about file size and line numbers, you may consider using the second one, you will avoid some headache.

Pure write speed will be better with the first solution.

Community
  • 1
  • 1
Jonathan Muller
  • 7,348
  • 2
  • 23
  • 31
  • I agree with the answer. It's a trade-off between convenience and speed. The objects are of similar nature, so an array is *natural*, especially if you plan to do something with all the files. The write speed for inserts is not going to be very different for the two options. Even for updates, it's not going to be noticeable unless you have a large number of files (in which case, the third option sounds better to me). – user3392439 Jan 25 '17 at 08:10