Deduplicaton / matching in Couchdb?

Question

I have documents in couchdb. The schema looks like below:

userId
email
personal_blog_url
telephone

I assume two users are actually the same person as long as they have

email or
personal_blog_url or
telephone

be identical.

I have 3 views created, which basically maps email/blog_url/telephone to userIds and then combines the userIds into the group under the same key, e.g.,

_view/by_email:
----------------------------------
key                   values     
a_email@gmail.com    [123, 345]
b_email@gmail.com    [23, 45, 333]

_view/by_blog_url:
----------------------------------
key                   values     
http://myblog.com    [23, 45]
http://mysite.com/ss [2, 123, 345]

_view/by_telephone:
----------------------------------
key                   values     
232-932-9088          [2, 123]
000-111-9999          [45, 1234]
999-999-0000          [1]

My questions:

How can I merge the results from the 3 different views into a final user table/view which contains no duplicates?
Or whether it is a good practice to do such deduplication in couchdb?
Or what would be a good way to do a deduplication in couch then?

ps. in the finial view, suppose for all dupes, we only keep the smallest userId.

Thanks.

score 2 · Accepted Answer · answered Oct 22 '12 at 22:39

2

Good question. Perhaps you could listen to _changes and search for the fields you want to be unique for the real user in the views you suggested (by_*).

Merge the views into one (emit different fields in one map):

function (doc) { if (!doc.email || !doc.personal_blog_url || !doc.telephone) return; emit([1, doc.email], [doc._id]); emit([2, doc.personal_blog_url], [doc._id]); emit([3, doc.telephone], [doc._id]); }
Merge the lists of id's in reduce
When new doc in changes feed arrives, you can query the view with keys=[[1, email], [2, personal_blog_url], ...] and merge the three lists. If its minimal id is smaller then the changed doc, update the field realId, otherwise update the documents in the list with the changed id.

I suggest using different document to store { userId, realId } relation.

answered Oct 22 '12 at 22:39

Marcin Skórzewski

2,854
1
17
27

Thanks Marcin. I think your idea works. I did not get success until the last step -- searching using multiple keys. if I set `keys=[[1, "a@b.com"], [2, "http://a.com"], [3,"334-333-2323"]]`, i always get all documents matched as result. maybe I should ask a new question on stackoverflow for this? – greeness Oct 23 '12 at 01:26
I am not sure what do you mean by "all documents". Without using reduce (just map) you should get the response JSON record with `"rows": [{"id": "1", "key": [1, "some@email"], "value": "1"}, {"id": "2", "key": [1, "some@email"], "value": "2"}, ... }]` for all the documents containing email, blog url or tel no. same as in your new record. Did you get some document which have non of these field matching? Note that for just map (without reduce) documents will not be sorted by user id. – Marcin Skórzewski Oct 23 '12 at 11:38
I assume `?keys=[[1, "a@b.com"], [2, "http://a.com"], [3,"334-333-2323"]]` is a multiple-key query. The result I got contains some documents which have none of these field matching. If I do only single-key query, the result is correct. Is there something wrong with the multi-key query? BTW, I am using couchDB 1.0.1. – greeness Oct 23 '12 at 20:57
Yes, `keys` should be the same as multiple `key` query with results merged. I do not know why your case seams not to work :( I have run this map and query and I do not know how to reproduce this effect on my data in CouchDB 1.2.0... Only thing coming yo my mind is a Note in the doc (http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options) <> I did not use reduce, just map. – Marcin Skórzewski Oct 24 '12 at 13:42

score 1 · Answer 2 · edited May 23 '17 at 11:48

You can't create new documents by just using a view. You'd need a task of some sort to do the actual merging.

Here's one idea.

Instead of creating 3 views, you could create one view (that indexes the data if it exists):

Key                             Values
---                             ------
[userId, 'phone']               777-555-1212
[userId, 'email']               username@example.com
[userId, 'url']                 favorite.url.example.com

I wouldn't store anything else except the raw value, as you'd end up with lots of unnecessary duplication of data (if you stored the full object for example).

Then, to query, you could do something like:

...startkey=[userId]&endkey=[userId,{}]

That would give you all of the duplicate information as a series of docs for that user Id. You'd still need to parse it apart to see if there were duplicates. But, this way, the results would be nicely merged into a single CouchDB call.

Here's a nice example of using arrays as keys on StackOverflow.

You'd still probably load the original "user" document if it had other data that wasn't part of the de-duplication process.

Once discovered, you could consider cleaning up the data on the fly and prevent new duplicates from occurring as new data is entered into your application.

Thanks. This approach seems to work for the situation when you have duplicate information for each user. But what I need is to de-duplicate users (w/ different userId but have common email/url/phone). — greeness, Oct 22 '12 at 19:37

Deduplicaton / matching in Couchdb?

2 Answers2