Executing two identical requests but the DISTINCT keyword gives unexpected results. Without the keyword, the result is ok but with DISTINCT, it looks like the where clause is ignored. Why ?
Cqlsh version:
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Table considered:
DESCRIBE TABLE events;
CREATE TABLE events (
userid uuid,
"timestamp" timestamp,
event_type text,
data text,
PRIMARY KEY (userid, "timestamp", event_type)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
Table content:
SELECT * FROM events;
userid | timestamp | event_type | data
--------------------------------------+--------------------------+------------+------
aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:07:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:08:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:09:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:10:17+0100 | toto | null
(6 rows)
Request1: Request without DISTINCT
SELECT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;
userid
--------------------------------------
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
(3 rows)
Request2: Same request with DISTINCT
SELECT DISTINCT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;
userid
--------------------------------------
aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
(2 rows)
EDIT 1
here is some context.
This table "events" is subject to a lot of writes, it receives around ~1k insertions per second and I have a batch script that checks those events every 5 minutes.
This batch script has 2 needs:
1- get all userids that have been active in the last 5 minutes (i.e every userid present in the events from the last 5 minutes)
2- get all events related to those userids (not only for the last 5 minutes)
I used to have two different tables to handle this. One table "activeusers" for the first request and the "events" table like I have described here for the second request. My problem with that is just that it requires from my server to write in two different tables when it receives an event. So I tried this using only the events table.