15

In short: If you have a large number of documents with varying sizes, where relatively few documents hit the maximum object size, what are the best practices to store those documents in MongoDB?

I have set of documents like:

{_id: ...,
  values: [12, 13, 434, 5555 ...]
}

The length of the values list varies hugely from one document to another. For the majority of documents, it will have a few elements, for a few it will have tens of millions of elements, and I will hit the maximum object size limit in MongoDB. The trouble is any special solution I come up with for those very large (and relatively few) documents might have an impact on how I store the small documents which would, otherwise, live happily in a MongoDB collection.

As far as I see, I have the following options. I would appreciate any input on pros and cons of those, and any other option that I missed.

1) Use another datastore: That seems too drastic. I like MongoDB, and it's not like I hit the size limit for many objects. In the words case, my application could treat the very large objects and the rest differently. It just doesn't seem elegant.

2) Use GridFS to store the values: Like a blob in a traditional DB, I could keep the first few thousand elements of values in document and if there are more elements in the list, I could keep the rest in a GridFS object as a binary file. I wouldn't be able to search in this part, but I can live with that.

3) Abuse GridFS: I could keep every document in gridFS. For the majority of the (small) documents the binary chunk would be empty because the files collection would be able to keep everything. For the rest I could keep the excess elements in the chunks collection. Does that introduce an overhead compared to option #2?

4) Really abuse GridFS: I could use the optional fields in the files collection of GridFS to store all elements in the values. Does GridFS do smart chunking also for the files collection?

5) Use an additional "relational" collection to store the one-to-many relation, but th number of documents in this collection would easily exceed a hundred billion rows.

Community
  • 1
  • 1
Ruggiero Spearman
  • 6,735
  • 5
  • 26
  • 37
  • Do you need to query these optional fields in any way? – Thilo Jun 25 '12 at 02:00
  • 1
    "Does GridFS do smart chunking also for the files collection?". No. The file metadata has to fit into a single BSON document. – Thilo Jun 25 '12 at 02:01
  • What kind of atomicity do you need for updates/inserts? – Thilo Jun 25 '12 at 02:05
  • Thanks for the comments Thilo. 1) I'd like to be able to query those optional fields, but I can give up on this requirement. 2) Thanks, that's what I suspected. 3) Atomicity is not critical, I can handle that at the application layer -- for instance, manually chunking the large documents and keeping them as two or three regular objects is an option. – Ruggiero Spearman Jun 25 '12 at 15:48
  • Just to add that there's a few details here - http://www.mongodb.org/display/DOCS/When+to+use+GridFS - on when and when not to use GridFS. If you don't need to query, then Gridfs should be good in your scenario. – Mark Hillick Jun 27 '12 at 14:22
  • Thanks Mark. My concern is the overhead that might be introduced by GridFS. Majority of my objects won't be needed to be stored in GridFS. That's why I'm trying to understand the details of the trade-offs. – Ruggiero Spearman Jun 27 '12 at 15:23
  • No problem, the trade-off is they're no longer documents, but binary blobs so you've no way of querying for anything inside them. 16mb as a size limit for a document is quite a lot of data (I think "War & Peace" is under 500kb) so if you're concerned about a losing functionality, maybe split your larger documents up. – Mark Hillick Jun 28 '12 at 15:10
  • Are all the values unique? if not, must you represent it as an array? for example, if a specific number is repeating many times, you can store it as {_id:..., values: {'1': <# occurances>, '5': <# occurances>, etc...}} – Meny Issakov Aug 17 '14 at 14:23
  • You can query the files table as if it was an ordinary collection. I do that myself and I have indexes to enhance searching by date, folder and owner of attachment. The only inconvenient in this approach is that you'll need a way to discriminate when to fill the actual gridfs attachment and when you can just use the metadata. This same rule would need to be apllied when retrieving such data, but in that case you could just query the attachment size. – ffflabs Aug 18 '14 at 12:23
  • It really depend of how you need to access the documents and what the documents are. Context here is what you need to leverage. An idea that pops to my mind is to compress the arrays in a way or another for instance as Meny as proposed. Another idea is to split large documents into many smaller ones. Here, at application level you will need to reunite the documents to create what you want but you could leverage the pipeline capability to query the data at will. – user983716 Feb 26 '17 at 17:42
  • "the number of documents in this collection would easily exceed a hundred billion rows" - would that really be a problem? MongoDB is designed to cope well with a collection having many, many documents. – Vince Bowdren Mar 23 '17 at 15:49

1 Answers1

2

If you have large documents, try to store some metadata about them in MongoDB, and put the rest of the data --the part you will not be querying on-- outside.

arboreal84
  • 2,086
  • 18
  • 21