4

I have some summation data that was very easy to generate using some relatively simple map/reduce views. But we want to sort the data based on the group-reduced view values (not the keys). It was suggested that we could use couchdb-lucene to do this. But how? It's not clear to me how to use a full text index to quickly rank this sort of data.

What we already have

An oversimplified example view looks something like the following:

by_sender: {
  map: "function(doc) { emit(doc.sender, 1); }",
  reduce: "function(keys, values, rereduce) { return sum(values); }"
}

Which returns results somewhat like the following (when run with group=true):

 {"rows":[
 {"key":"a@example.com","value":2},
 {"key":"aaa@example.com","value":1},
 {"key":"aaap@example.com","value":34},
 {"key":"aabb@example.com","value":1},
 ... thousands or tens of thousands of rows ...
 ]}

What we want

Those are sorted by the key, but I need to sort it data according the values, like so:

 {"rows":[
 {"key":"xyzzy@example.com","value":847},
 {"key":"adam@example.com","value":345},
 {"key":"karl@example.com","value":99},
 {"key":"aaap@example.com","value":34},
 ... thousands or tens of thousands of rows ...
 ]}

And I need it sorted as quickly as is reasonably possible (e.g. if it only takes <100ms to update the indexes, it shouldn't take 1 minute before the new data is reflected in queries).

More context: what we already tried

The best answer on Sorting CouchDB Views By Value gives four viable options, which we've tried in increasing order of difficulty:

  1. First we sorted the results client side, but that was way too slow.
  2. Next we created a list function which sorts the data. A little faster, but still too slow.
  3. Chained Map-Reduce Views should handle this problem easily.
    • Someone pointed out Cloudant's Chained Map-Reduce Views. They are not in BigCouch but are part of Cloudant's services, which are unfortunately not in our budget at this time.
    • I started an application layer implementation using the _bulk_docs API. It is tricky if you want to keep updates as snappy as possible while avoiding race conditions, etc. I can continue with this approach, but it is not relaxing. :(
  4. The answer suggested using couchdb-lucene. But I'm not nearly familiar enough with full-text search to understand how to get it to do anything more sophisticated than index the document and return a search result. I don't even know where to start.
Community
  • 1
  • 1
nicholas a. evans
  • 2,163
  • 1
  • 17
  • 17
  • Hi, I not sure what do you mean by the list view? You can try to insert the group results into a separate database, and you can easily do a sort. I guess you can use make use of a cronjob to update it regularly. – ajreal Mar 27 '12 at 16:45
  • Sorry, I meant list *function* not list *view*. I've updated the question. – nicholas a. evans Mar 27 '12 at 17:06
  • I *also* forgot to mention that we need this data to be as close to real-time as is reasonably possible. If it only takes <100ms to absorb the new documents into the various view indexes, we don't want to wait 1 minute for a cronjob to get around to it. I'm also open to the possibility that couchdb is just wrong for this particular task. Another option is to have a daemon monitoring the changes feed and issue a redis `zincrby messages:by_sender 1 $sender` for each new document. – nicholas a. evans Mar 27 '12 at 17:14
  • Are you doing some logging? I'm thinking you can use separate document to track the count. When inserting a new document, lookup the document corresponding to the email address, if not exist, create a new document, if exist, just plus +1. – ajreal Mar 27 '12 at 17:35
  • Yes, we could absolutely go with the dual entry approach. But if I'm going that way, I'll probably store the counts in a redis zset instead of a couchdb document. :) – nicholas a. evans Mar 27 '12 at 17:55
  • oh, great. I not too sure is redis support ordering? – ajreal Mar 27 '12 at 17:57
  • @nicholasa.evans Did you find a proper solution to your problem? – Liran Brimer Apr 02 '14 at 11:14

1 Answers1

0

I had a similar problem. Needed to count the votes per article and sort the articles by their number of votes. I have solved using a separate document to track each vote, and another document that stores the votes count per article. Let's call them: article, vote, score. I wrote a cron script that update the score for each article, counting the "not registered" votes. The script calls a view using the _count reduce function, where only the "not registered" votes are emitted (registered == FALSE). I use the option group_results to have the number of not registered votes per article, then I update the score for each article, marking "registered" the votes. At this point I have a view that emit as key the score of each article and as value the article id. So the articles can be ordered by score. Conflicts are avoided using this tecnique.

noun
  • 3,635
  • 2
  • 25
  • 27