RethinkDB: count unique rows within grouped data

Question

I'm trying to count all unique rows within grouped data, i.e, how many unique rows exist within each group.

Although groupedData.distinct().count() works for relatively small amounts of rows, running it on ~200k rows, such as in my case, ends with "over size limit".

I understand why it happens, yet I can't come up with more efficient way of doing it - is there a way?

score 0 · Answer 1 · answered Nov 18 '15 at 18:46

Count is an expensive thing in RethinkDB to my experience. Especially for count operation that require iterating the whole data set. I myself struggle with this for a bit before.

To my understanding, when you pass groupData to distinct, it creates an array, because groupData will be a sequence, therefore it has 100,000 element limits.

To solve this, I think we have to use a stream, and count the stream instead. We cannot use group because it returns a group of stream, or in other words, an array of stream to my understanding again.

So here is how I solve it:

Create an index on the field I want to groups
Call distnct on that table with the index.
Map the stream, passing value into a count function with getAll, using index

An example query

r.table('t').distinct({index: 'index_name'})
    .map(function(value) {
      return {group: value, total: r.table('t').getAll(value, {index: 'index_name'}).count()}
    })

With this, everything is a stream and we can lazily iterator over result set to get the count of each group.

I get your general direction, and it indeed helps, yet now the problem is distinct()ing the whole table. It takes forever in my case, even when indexed (>2.5 million docs) — Kludge, Nov 19 '15 at 14:15
@Kludge what does your data model look like? Kureikain is right that `count`ing is at the moment an expensive operation, but I'm not sure that the index is set up correctly. — dalanmiller, Jul 26 '16 at 21:23
This is a relatively old question, unfortunately I can't remember :/ btw I'm not sure counting is the expensive part, it's distinct() iiuc — Kludge, Jul 27 '16 at 09:21

RethinkDB: count unique rows within grouped data

1 Answers1