How to check for duplication before creating a new document in CouchDB/Cloudant?

Question

We want to check if a document already exists in the database with the same fields and values of a new object we are trying to save to prevent duplicated item.

Note: This question is not about updating documents or about duplicated document IDs, we only check the data to prevent saving a new document with the same data of an existing one.

Preferably we'd like to accomplish this with Mango/Cloudant queries and not rely on views.

The idea so far is:

1) Scan the the data that we are trying to save and dynamically create a selector that matches that document's structure. (We can't have the selectors hardcoded because we have types of many documents)

2) Query de DB with for any documents matching that selector to if any document already exists that matches those criteria.

However I wonder about the performance of this approach since many of the selector fields will not be indexed.

I also much rather follow best practices than create something out of the blue, but haven't been able to find any known solutions for this specific scenario.

If you happen to know of any, please share.

Are we talking about big documents? Do you have any document samples that could help us to give ideas? You could always create an hash form the document content. — Alexis Côté, Jan 30 '18 at 22:24
Some are small some are large, say up 30 attributes for the larger ones.. but actually we may not need to do a deep comparison.. might be enough for our usecase just compare the properties which are simple fields (not objs or arrays) at the root level of the document. — J. Araujo, Jan 31 '18 at 13:02

Juanjo Rodriguez · Accepted Answer · 2018-01-31T21:17:38.423

0

Option 1 - Define a meaningful ID for your documents

The ID could be a logical coposition or a computed hash from the values that should be unique

If you want to check if a document ID already exists you can use the HEAD method

HEAD /db/docId

which returns 200-OK if the docId exits on the database.

If you would like to check if you have the same content in the new document and in the previous one, you may use the Validate Document Update Function which allows to compare both documents.

function(newDoc, oldDoc, userCtx, secObj) {
...
}

Option 2 - Use content hash computed outside CouchDB

Before create or update a document a hash should be computed using the values of the attributes that should be unique.
The hash is included in the document in a new attribute i.e. "key_hash"
Create a mango index using the "key_hash" attribute
When a new doc should be inserted, the hash should be computed and find for documents with the same hash value using a mango expression before the doc is inserted.

Option 3 - Compute hash in a View

Define a view which emit the computed hash for each document as key
- Couchdb Javascript support does not include hashing functions, this could be difficult to include in a design document.
- Use erlang to define the map function, where you can access to the erlang support for hashing.
Before creating a new document you should query the view using a the hash that you need to compute previously.

edited Jan 31 '18 at 21:17

answered Jan 30 '18 at 20:22

Juanjo Rodriguez

2,103
8
19

I'm not trying to do any of those things. The ID is different because it's a new document [notice that I specifically say "minus _id field" in the question] and also this is not a patch/update operation, it's a new document that is being inserted that I need to compare to existing documents in the DB. – J. Araujo Jan 30 '18 at 20:44
It is an option for you to compute a hash based on those attribute values that should be unique and add it to the docs? – Juanjo Rodriguez Jan 30 '18 at 22:23
Maybe, I just have not yet heard of this solution. Can you point me to any links/examples where this was used in the context of avoiding duplications in a noSQL database? – J. Araujo Jan 31 '18 at 16:54
Haven't looked yet on how to use Option 3, but I really like Option 2. The only drawback seems to be race conditions: I'm allowing/denying an insert based on a previous state of the DB which may have changed in the meantime... but maybe that's a limitation of the platform? – J. Araujo Jan 31 '18 at 21:29
Maybe you can do a cleaning process based on a view that counts how many documents with the same hash do you have stored. – Juanjo Rodriguez Jan 31 '18 at 22:11
I'm afraid either way (a request to a view or a query) you're still relying on the state of the DB at a previous moment in time. Instead when you use a compound unique index/key in MongoDB or a relational DB the validation is done at insert time at the DB layer, hence no race condition is possible. But we don't have such feature in couchdb. I think this is why they insist you have to use the document ID, since this seems to be the only uniqueness that the DB garantees – J. Araujo Feb 01 '18 at 12:21

gadamcox · Answer 2 · 2018-01-31T19:40:15.177

0

One solution would be to take Juanjo's and Alexis's comment one step further.

Select the keys you wish to keep unique
Put the values in a string and generate a hash
Set the document's _id to that hash
PUT the document on the database.
check return for failure

If another document already exists on the database with the same _id value, the PUT request will fail.

edited Jan 31 '18 at 19:40

answered Jan 31 '18 at 18:16

gadamcox

191
6

The check is supopsed to run on all fields, this is not a check on specific field, so that's why we don't want to use the ID to store the hash. Having said that, storing a hash in another, dedicated field seems to be the most viable option – J. Araujo Jan 31 '18 at 21:25

How to check for duplication before creating a new document in CouchDB/Cloudant?

2 Answers2

Linked