0

I'm trying to count all unique rows within grouped data, i.e, how many unique rows exist within each group.

Although groupedData.distinct().count() works for relatively small amounts of rows, running it on ~200k rows, such as in my case, ends with "over size limit".

I understand why it happens, yet I can't come up with more efficient way of doing it - is there a way?

Kludge
  • 2,653
  • 4
  • 20
  • 42

1 Answers1

0

Count is an expensive thing in RethinkDB to my experience. Especially for count operation that require iterating the whole data set. I myself struggle with this for a bit before.

To my understanding, when you pass groupData to distinct, it creates an array, because groupData will be a sequence, therefore it has 100,000 element limits.

To solve this, I think we have to use a stream, and count the stream instead. We cannot use group because it returns a group of stream, or in other words, an array of stream to my understanding again.

So here is how I solve it:

  1. Create an index on the field I want to groups
  2. Call distnct on that table with the index.
  3. Map the stream, passing value into a count function with getAll, using index

An example query

r.table('t').distinct({index: 'index_name'})
    .map(function(value) {
      return {group: value, total: r.table('t').getAll(value, {index: 'index_name'}).count()}
    })

With this, everything is a stream and we can lazily iterator over result set to get the count of each group.

kureikain
  • 2,304
  • 2
  • 14
  • 9
  • I get your general direction, and it indeed helps, yet now the problem is distinct()ing the whole table. It takes forever in my case, even when indexed (>2.5 million docs) – Kludge Nov 19 '15 at 14:15
  • @Kludge what does your data model look like? Kureikain is right that `count`ing is at the moment an expensive operation, but I'm not sure that the index is set up correctly. – dalanmiller Jul 26 '16 at 21:23
  • This is a relatively old question, unfortunately I can't remember :/ btw I'm not sure counting is the expensive part, it's distinct() iiuc – Kludge Jul 27 '16 at 09:21