Mongodb document format consistency

Question

I am relatively new to Mongodb and I have an idea in mind but not sure how to go about it.

What I would like to do is essentially hash a mongodb document (preferably it's Json format so it is not database specific) and store that hash somewhere with a reference to that specific document. This needs to allow me to retrieve the document in the future via a query and compare against the stored hash.

My idea is to get the json representation of the DBObject, hash it and then add the hash as a field to that specific document before persisting it. Then when querying for the object, make sure to exclude the hash field from the answer so the returned DBObject includes the same hash.

1 - Does mongodb always return a consistent DBObject format which will always convert to the same json so that the hash would always be the same

2 - Would such an implementation even be viable? As in storing the hash with the object itself, essentially changing the object (thus making the hash invalid) but getting around by not retrieving that field in the response

3 - If the implementation would not work, what would be the simplest way to store the hash, another object with a reference to the original document?

score 1 · Accepted Answer · edited May 23 '17 at 12:14

1- Does mongodb always return a consistent DBObject format which will always convert to the same json so that the hash would always be the same. - No Mongo does not guarantee the order so the json can be different based on what kind of updates were done on the document. There is no guarantee that the field order will be consistent, or the same, after an update. If no such order changing updates were done on it then the order should be preserved MongoDB update on Field Order . But when you serialize the json into an object using Jackson or something else it will serialize to the same object and should have the same hash.

2 - Would such an implementation even be viable? As in storing the hash with the object itself, essentially changing the object (thus making the hash invalid) but getting around by not retrieving that field in the response. Looks like from this answer you can use Jakson or Gson to hash the json object, even though it is not ordered. excluding a field should not be a problem. If you store the hash as a field in the object itself all the write queries that save ( which is an overwrite of the entire document ) will have to write the hash into it. If any of them fail to do so the hash will be lost. An update query will have another problem since along with changing the data it also has to update the hash of the document. So this will have to involve reading the object, modifying it, computing the hash and storing it back. You will not be able to use the primitive update queries.

If you make the hash as the primary key which is _id field that would mitigate this problem although you probably need it for something else.

3- The simplest way would be to store the _id of the document to be hashed into another collection along with the hash as the _id of the new collection.

{
    "_id":<hash code of docuemnt>,
    "refer":<_id of the document to be hashed>
}

This would involve multiple read writes which will hurt performance and depending on your use case it

Mongo according to me is a simplistic database designed to store and retrieve objects. If you have the need to do something complicated with it other than retrieving fast and writing its probably not fit for the task.

Thank you. Appreciate the input. I should have mentioned that the idea is that there will be no updating whatsoever on documents. They are supposed to be immutable so some of the comments may not apply. As for the hash, it was mostly an example but I meant some sort of computation using the entire document as the key yielding a value (like a sha-256 or evem an external http call) and not necessarily the typical hash computation. Could you elaborate on why using the hash as the I'd of the second document and not simply another field in that document? — Alexandre Thenorio, Aug 16 '16 at 22:42
What I am thinking of is to potentially compare the document hash to its stored value thus the need to lookup the reference on the secondary collection would be greater than looking the hash value (as that would be what I am after) — Alexandre Thenorio, Aug 16 '16 at 22:51
Makes sense. I guessed it that you were not planning to make any updates but put the comments just for reference. Making the hash_code the _id of the second collection would mean you can look up the document using its computed hash quickly. Since _id field is automatically indexed by mongo. If its another field you will have to put and index on it which will make the writes slow if you care about them. — Wolf7176, Aug 16 '16 at 23:21
Storing it in the same document collection is a better option if you always want the hash. But since you don't want the hash always your major query is going to involve projection and this makes it slower. — Wolf7176, Aug 16 '16 at 23:31

Mongodb document format consistency

1 Answers1