4

Is there a good way to mimic the behavior of SELECT COUNT(DISTINCT field) in CouchDB?

Imagine we have the following document, which records the time at which a user played a certain song:

{
  song_id: "happy birthday",
  user_id: "boris",
  date_played: [2011, 11, 14, 00, 12, 55],
  _id: ...
}

I'd like to know the number of distinct songs ever played by our user "boris". If our user has listened to "happy birthday" 20 times, that song should still contribute just +1 to the overall song count.

In MySQL, I'd simply execute SELECT COUNT(DISTINCT song_id) FROM plays WHERE user_id = "boris", but I'm drawing a blank when it comes to writing this in CouchDB.

Work-Around 1: If I changed my schema and instead stored all the song-plays inside a single user document for "boris" I could then write a map to emit only distinct values. However, if I wanted to build something on the scale of last.fm, my fear is that updates would start taking a very long time as the "boris" document size (number of plays) continued to grow. (There might also be a maximum document size that I would eventually hit).

Work-Around 2: I could also write a map function to return all of the distinct records, which my python script could sum up itself; but again with hundreds of thousands of distinct songs this would become very slow as well.

What other options am I missing?

Rob Crowell
  • 1,447
  • 3
  • 15
  • 25
  • I also considered simply writing this information as a separate document: check to see if (user_id, song_id) already exists, if so update timestamp, otherwise insert new document. But I was hoping I could use CouchDB's incremental map/reduce updates to build this cache for me instead. – Rob Crowell Nov 15 '11 at 07:21

4 Answers4

3

Assuming I've interpreted your question correctly;

map:

function(doc) {
  emit([doc.user_id, doc.song_id], null);
}

reduce:

_count

query:

?startkey=[<userid>]&endkey=[<userid>,{}]&group=true

Sample output:

http://127.0.0.1:5984/foo/_design/a/_view/b?group=true&
startkey=[%22foo%22]&endkey=[%22foo%22,{}]

{"rows":[
  {"key":["foo","bar"],"value":2},
  {"key":["foo","bazbar"],"value":1}
]}
Robert Newson
  • 4,631
  • 20
  • 18
  • 3
    This map/reduce would give you (["boris", "happy birthday"], 20), (["boris", "yesterday"], 14), ..., but if the user has listened to 50,000 different songs then this query will return 50,000 different rows. Basically what I'd like to get back is just the scalar value 50,000. – Rob Crowell Nov 15 '11 at 16:42
3

This answer was provided by Zachary Zolton on the couchdb mailing list:

http://mail-archives.apache.org/mod_mbox/couchdb-user/201111.mbox/%3CCAGnHtbJ-1-YeLWMLivKzWub98HZY7%2BesnPOHU4pEYgWAsxaszA%40mail.gmail.com%3E

Since you've already got a view that'll give you Boris's 50k unique songs, you could use a _list function to return the number of rows.

Something like this should do the trick:

function() {
 var count = 0;
 while(getRow()) count++;
 return JSON.stringify({count: count});
}

If you query this list function, with the same view, key range and group level, it'll just respond with a bit of JSON, such as: {"count":"50612"}

You can read up more here:

Rob Crowell
  • 1,447
  • 3
  • 15
  • 25
0

I have been struggling with exactly the same thing (see http://mail-archives.apache.org/mod_mbox/couchdb-user/201410.mbox/browser)

It just doesn't feel right to get all this output when you just need a scalar value. Even though the list function is a workaround for not getting the full stream of results, this approach feels very odd.

Any alternatives to do that ?

0

In recent versions of CouchDB (>2.2) you can use the _approx_count_distinct reduce function. Your view would be:

map:

function(doc) {
    emit([doc.user_id, doc.song_id], 1);
}

reduce:

_approx_count_distinct

and the query to get the count of song_ids for user "boris" would be:

/db/_design/_myddoc/_view/myview?group_level=1&key=["boris"]
Will Holley
  • 1,745
  • 10
  • 11