0

I am running some performance benchmarks on RethinkDB (related to a specific use-case). In my simulation, there are 2 tables: contact and event. A contact has many events. The event table has 2 indices: contact_id and compound index on [campaign_id, node_id, event_type]. The contact table has about 500k contacts and about 1.75 million docs in event table.

The query I am struggling with is to find all the contacts who have sent event_type but not open event_type. Following is the query I got to work:

r.table("events"). get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'].distinct .set_difference (r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'].distinct) .count.run(conn)

But this query uses set difference, not stream difference. I have also tried using difference operator:

r.table("events"). get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'] .difference (r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id']) .count.run(conn)

This query never finishes and the weird thing is even after aborting the query I see (in RethinkDB dashboard) that the reads dont stop.

Whats the most efficient way of doing these kind of queries?

Follow up: find all the male contacts who have sent event_type but not open event_type. What I have now is:

r.table("contacts").get_all(r.args( r.table("events").get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'].distinct .set_difference (r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'].distinct))) .filter({gender: 1}).count.run(conn)

shardnit
  • 127
  • 1
  • 9

1 Answers1

1

One way to make this efficient is to denormalize your data. Instead of having separate contact and event tables, just have the contact table and make each contact have an array of events. Then you can write:

r.table('contacts').indexCreate('sent_but_not_open', function(row) {
  return row('events').contains('sent').and(
    row('events').contains('open').not());
});

That will work well if the number of events per contact is smallish. If you have thousands or millions of events per contact it will break down though.

RethinkDB doesn't offer a way to diff two streams lazily on the server. The best you could do is to change your compound index to be on [campaign_id, node_id, event_type, contact_id] instead, replace your get_all([1, 5, 'sent'], {index: 'campaign'}) with .between([1, 5, 'sent', r.minval], [1, 5, 'sent', r.maxval], {index: 'campaign'})and then put.distinct({index: 'campaign'})['contact_id']on the end. That will give you a stream of distinct contact IDs rather than an array, and these contact IDs will be sorted. You can then do the same for theopen` events, and diff the two ordered streams in the client by doing a mergesort-like thing.

mlucy
  • 5,249
  • 1
  • 17
  • 21
  • yeah, you guessed it right. A contact can have any number of events (even duplicate events per campaign). So we need a scalable way to collect these contacts. Your suggestion about changing compound index will work for collecting these contacts but thats not going to be fast enough to show the count to the end user. Any advice on how to get the count for such queries, even with some margin of error? – shardnit Sep 10 '15 at 10:06
  • You'd have to write client-side code for that. You can use changefeeds to receive a notification whenever there's a new document inserted, and update a running count that you keep in your client. Unfortunately I think that's the best you can do right now if it needs to be faster than what you can get with compound indexes. – mlucy Sep 13 '15 at 05:32
  • yeah I guess thats the best I can do. Any suggestion on the followup query where there is a predicate on the contact also - find all the male contacts who have `sent` event_type but not `open` event_type? – shardnit Sep 14 '15 at 08:39
  • As long as you're doing an equality comparison, you can just add the field in question to the compound index and it should work. Otherwise you'll have to do a separate filter. – mlucy Sep 14 '15 at 17:13