Insert query replaces rows having same data field in Cassandra clustering column

Question

I'm learning Cassandra, started off with v3.8. My sample keyspace/table looks like this

CREATE TABLE digital.usage (
    provider decimal,
    deviceid text,
    date text,
    hours varint,
    app text,
    flat text,
    usage decimal,
    PRIMARY KEY ((provider, deviceid), date, hours)
) WITH CLUSTERING ORDER BY (date ASC, hours ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

Using a composite PRIMARY KEY with partition key as provider and deviceId, so that the uniqueness and distribution is done across the cluster nodes. Then the clustering keys are date and hours.

I have few observatons:

1) For a PRIMARY KEY((provider, deviceid), date, hours), while inserting multiple entries for hours field, only latest is logged and the previous are disappeared.

2) For a PRIMARY KEY((provider, deviceid), date), while inserting multiple entries for same date field, only latest is logged and the previous are disappeared.

Though i'm happy with above(point-1) behaviour, want to know whats happening in the background. Do I have to understand more about the clustering order keys?

undefined_variable · Answer 1 · 2017-07-27T07:23:25.150

3

PRIMARY KEY is meant to be unique.

Most of RDBMS throws error if you insert duplicate value in PRIMARY KEY.

Cassandra does not do Read before Write. It creates a new version of record with latest timestamp. When you insert data with same values for columns in primary key, new data will be created with latest timestamp and while querying (SELECT) record with only latest timestamp is returned back.

Example:

PRIMARY KEY((provider, deviceid), date, hours)
Insert into digital.usage(provider, deviceid, date, hours,app,flat) values(1.0,'a','2017-07-27',1,"test","test") 
 ---- This will create a new record with let's say timestamp as 1
Insert into digital.usage(provider, deviceid, date, hours,app,flat) values(1.0,'a','2017-07-27',1,"test1","test1") 
 ---- This will create a new record with let's say timestamp as 2

SELECT app,flat FROM digital.usage WHERE provider=1.0 AND deviceid='a' AND date='2017-07-27' AND hours=1

Will give 
 ------------
| app | flat | 
|-----|------|
|test1|test1 |
 ------------

edited Jul 27 '17 at 07:23

answered Jul 27 '17 at 06:29

undefined_variable

6,180
2
22
37

Thanks for quick reply. True, primay key is meant to be unique. But why only last field of primary key? From point-1 in my question, hours are affected. Why not date too? – srikanth Jul 27 '17 at 06:33
your point 1 and point 2 are one and same... if you enter multiple entries with same primary key values then only latest will be available – undefined_variable Jul 27 '17 at 07:05
If you enter different hour value for same date then all will be visible... point being nothing from `provider, deviceid, date, hours` should have same values as of any previous record... if it is same then value from latest timestamp wins – undefined_variable Jul 27 '17 at 07:26
1

If you haven't watched DataStax guide to data modeling, I'd advice to do so. But the general rule in Cassandra modeling is: 1. Queries that you'll run equality searches on should be in a partition key. 2. clustering keys should be used for a range scan (if needed) and they decide what's going to be the order within the partition. 3. Ensure that PK is always going to be unique, so in your case you might just have to add a random `uuid` to ensure uniqueness. – Evaldas Buinauskas Jul 27 '17 at 17:14
how to get older timestamp rows as well? – Jus12 Nov 27 '19 at 10:03

Insert query replaces rows having same data field in Cassandra clustering column

1 Answers1

Linked