How should I design my data storage scheme using mongoDB?

Question

My data is like belows : I have about 1,000,000 gene sequence data ,but some of them are very short ,hundreds of characters , but some of them is so large that its BSON size has already exceeds the 16M per-document size limit of MongoDB ,up to about 10,000,000 charactors for one sequence. So ,I am considering using GridFS to store these sequence as a file.So , I fall into a delima:

solution 1:store all gene sequence as files using GridFS, no matter they are small or big.

solution 2:only store very-big-size gene sequence as a file using GridFS,and store small-size gene sequences as normal document. But it leads to another problem, query is no longer simple , because gene sequences is stored in two different ways ,for every query ,I have to query both of them .

I am new for MongoDB,So , Many of my thoughts looks ridiculous.But I really need your help.

I would not use gridFS to store data only for storing files,. I would rather break up the huge char sequence into some meaningfull document structures. (on gridfs binaries no index no smart queries) — attish, Sep 12 '13 at 07:30
Well , a gene sequence is just a set of characters,without any meaning.You mean I break them into several documents?Can you make it more detailed?You said "on gridfs binaries no index no smart queries",of course, a binary file itself cannot be queried or indexed, but according to my knowledge on GridFS, I can give any file some metadata when I storing them into MongoDB,these metadata can be used to query or indexed .right?Thanks. — wuchang, Sep 12 '13 at 07:37
Using Grid-FS only sounds like the better solution here. As you already noticed yourself, using a two-way approach introduces a lot of additional complexity. Unless you need to search these sequences. In this case: what @attish wrote. — Philipp, Sep 12 '13 at 07:38
Sure , searching these sequence is necessary ,because some one may want to download a sequence.Before downloading ,he has to find it. — wuchang, Sep 12 '13 at 07:42
IF you likely to run queries based on the content of the gene sequence you are not able to do it with storing in GridFS, however if you just likely to run queries to find the right gene sequence as a whole based on some metadata which is independent from the sequence itself in gridFS you can store this metadata in the files collection and you can query this and can have indexes on. If you likely to run queries based on the content of the gene sequence you cannot use GridFS, you have to break up the sequence to smaller parts and store as normal docs — attish, Sep 12 '13 at 07:55
For your comments: "if you just likely to run queries to find the right gene sequence as a whole based on some metadata which is independent from the sequence itself in gridFS you can store this metadata in the files collection and you can query this and can have indexes on. " — wuchang, Sep 12 '13 at 07:59
Well ,As your said ,if I break these sequence into several documents, how can I make a connection between these documents? — wuchang, Sep 12 '13 at 08:23
It totally depends on the data you store. If there is some logic that for example you can say every 10 byte represent something, which is meaningful individually than this is the level of the separation. — attish, Sep 12 '13 at 11:46

How should I design my data storage scheme using mongoDB?

0 Answers0

Linked