2

I'm trying to figure out a way to count the number of unique values in a bucket that is not a primary key. Couchbase 2.5 provides a n1ql method to do this. If we use their beer-sample you can issue the following command:

select count(distinct style) from beer-sample

which returns a scalar value of 68.

I'm using couchbase 2.2.0, which technically doesn't have n1ql. And I want to leverage the map/reduce/rereduce functionality if possible. The reason is that I have 100's of millions of records and the adhoc query will probably take days to run. Is there such a way to do this?

For the map function I have the following:

function (doc, meta) {
     if ( doc.type == "beer")
       emit(doc.style, doc.style);
}

for the reduce I have the following:

function(key, values, rereduce) {
    var u = {}, a = [];
    var results = {};

    if (rereduce) {

      for (var i = 0; i < values.length; i ++ ) {
        for ( var j = 0; j < values[i].length; j ++ ) {
          if (u.hasOwnProperty(values[i][j])) {
            continue;
          }

          a.push(values[i][j]);
          u[values[i][j]] = 1;
        }
      }
      return (a); 
    } else {  
      for(var i = 0; i < values.length; i++) {
        if (u.hasOwnProperty(values[i])) {
          continue;
        }

        a.push(values[i]);
        u[values[i]] = 1;
      } 
      return(a);  
  }
}

This returns an array with unique values but not a scalar count. Any way I can just get the scalar count of unique styles of beers? Thanks.

Bo.
  • 580
  • 5
  • 17

2 Answers2

1

The solution to this is relatively straightforward (for the view at least).

Firstly, there's no need to output the doc/beer style as both the key and the value, so your map function would be better as:

function (doc, meta) {
    if (doc.type == "beer") {
        emit(doc.style, null)
    }
}

Next, simply use the built-in _count reduce function.

By default, this will simply output the count of all documents in the bucket that you're counting, however by calling the map function with the filter parameters group set to true and group level set to 1 (the exact method will vary according to your client SDK). Doing so will return an array of objects similar to as follows:

{"rows":[
{"key":null,"value":1111},
{"key":"American Rye Ale or Lager","value":11},
{"key":"American-Style Amber/Red Ale","value":219},
{"key":"American-Style Barley Wine Ale","value":32},
{"key":"American-Style Brown Ale","value":187},
{"key":"American-Style Cream Ale or Lager","value":12},
{"key":"American-Style Dark Lager","value":1},
{"key":"American-Style Imperial Stout","value":55},
{"key":"American-Style India Black Ale","value":1},
{"key":"American-Style India Pale Ale","value":230},
{"key":"American-Style Lager","value":370},
{"key":"American-Style Light Lager","value":39},
{"key":"American-Style Pale Ale","value":393},
{"key":"American-Style Stout","value":241},
{"key":"American-Style Strong Pale Ale","value":8}
…
…
]
}

This array can be made smaller with the key filter parameter (with the key being a particular style, in this case (or whatever it is that you wish to count)) or, similarly, you can pick from this client side.

mrkwse
  • 420
  • 7
  • 16
  • (with `value` equalling the count of documents of that value) – mrkwse Jun 26 '14 at 14:48
  • The _count reduce is highly optimised, also, so even will millions of documents, it wouldn't use excessive resources. Queries hit an index so the only issue would potentially be the number of reads/queries of this index. The index update is incremental, so even when adding/updating docs, unless millions were updated at once, again this should not produce much overhead after the initial index build (see [here](http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#view-performance)). – mrkwse Jun 26 '14 at 15:30
  • To further clarify that 'reads/queries' may be a potential issue, this is meant in terms of the only issue here would be usual database read bottlenecks (Disk I/O and space, bandwidth, and RAM _not_ in use by Couchbase). – mrkwse Jun 27 '14 at 08:31
  • @Mwuk...thanks for the input but this doesn't answer the original question. I need a scalar value count of all the unique styles of beers, not a list with a count of each style. Filtering on a key does not help and can't be done in my situation. Any other thoughts? Thanks. – user3775720 Jul 01 '14 at 14:17
  • @user3775720 after some tinkering I think the thing to do would be to perform a grouped count as I detailed above, and then count the rows that are returned client side in your SDK (which SDK would you be using - I can try and detail better if you tell me which particular language you're (intending to be) using. – mrkwse Jul 02 '14 at 19:18
  • @Mwuk...for the beer example, your recommendation would work. However for my scenario, where I have 100's of millions of groups this will not work. – user3775720 Jul 03 '14 at 14:18
  • How do you using the built in _count? What does that mean? I'm just emiting emit(doc.host, null); and using _count in the reduce doesn't do what I expect. – Justin Thomas Aug 27 '15 at 21:28
0

If the number of distinct groups won't be too big, try passing an associative array in the reduce function.

In beer-sample bucket:

/**
 * Map function
 */
function (doc, meta) {
  if (doc.type == "beer" && doc.style)
    emit(doc.style, null);
}

/**
 * Reduce function
 */
function (keys, values, rereduce) {
  count_by_key = {};
  if (rereduce) {
    for (i in values) {
      _count_by_key = values[i];
      for (key in _count_by_key) {
        count_by_key[key] = _count_by_key[key] + (count_by_key[key] || 0);
      }
    }
  } else {
    if (keys)
      for (i in keys) {
        key = keys[i];
        count_by_key[key] = 1 + (count_by_key[key] || 0);
      }
  }
  return count_by_key; 
}

The number of keys in the result value would be the scalar count of unique styles of beers. It also works with key filters.

Bo.
  • 580
  • 5
  • 17