I am running some performance benchmarks on RethinkDB (related to a specific use-case). In my simulation, there are 2 tables: contact
and event
. A contact has many events. The event table has 2 indices: contact_id and compound index on [campaign_id, node_id, event_type]
. The contact
table has about 500k contacts and about 1.75 million docs in event
table.
The query I am struggling with is to find all the contacts who have sent
event_type but not open
event_type. Following is the query I got to work:
r.table("events").
get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'].distinct
.set_difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'].distinct)
.count.run(conn)
But this query uses set difference, not stream difference. I have also tried using difference operator:
r.table("events").
get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'] .difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'])
.count.run(conn)
This query never finishes and the weird thing is even after aborting the query I see (in RethinkDB dashboard) that the reads dont stop.
Whats the most efficient way of doing these kind of queries?
Follow up: find all the male contacts who have sent
event_type but not open
event_type. What I have now is:
r.table("contacts").get_all(r.args(
r.table("events").get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'].distinct
.set_difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'].distinct)))
.filter({gender: 1}).count.run(conn)