alllow filtering, data modeling in cql

Question

I'm currently using and researching about data modeling practices in cassandra. So far, I get that you need have a data modeling based on the queries executed. However, multiple select requirements make data modeling even harder or impossible to handle it on 1 table. So, when you can't handle these requirements on 1 table, you need to insert 2-3 tables. In other words, you need to make multiple inserts on 1 operation.

Currently, I'm dealing with a data model of a campaign structure. I have a campaign table on cassandra with the following cql;

CREATE TABLE campaign_users
(
    created_at timeuuid,
    campaign_id int,
    uid bigint,
    updated_at timestamp,
    PRIMARY KEY (campaign_id, uid),
    INDEX(campaign_id, created_at)
);

In this model, I need to be able to make incremental exports given a timestamp only. In cassandra, there is allow filtering mode that enables select queries for secondary indexes. So, my cql statement for incremental export is the following;

select campaign_id, uid 
from campaign_users
where created_at > minTimeuuid('2013-08-14 12:26:06+0000') allow filtering;

However, if allow filtering is used, there is a warning saying that the statement have unpredictable performance. So, is it a good practice relying on allow filtering ? What can be other alternatives ?

Did CQL change or what's up with that syntax? I don't see where `INDEX` inside a `CREATE TABLE` is allowed, nor am I allowed to create the index separately: `CREATE INDEX ON campaign_users(campaign_id, created_at);` — nilskp, May 05 '15 at 21:07

score 14 · Accepted Answer · answered Sep 09 '13 at 13:28

The ALLOW FILTERING warning is because Cassandra is internally skipping over data, rather than using an index and seeking. This is unpredictable because you don't know how much data Cassandra is going to skip over per row returned. You could be scanning through all your data to return zero rows, in the worst case. This is in contrast to operations without ALLOW FILTERING (apart from SELECT COUNT queries), where the data read through scales linearly with the amount of data returned.

This is OK if you're returning most of the data, so the data skipped over doesn't cost very much. But if you were skipping over most of your data a lot of work will be wasted.

The alternative is to include time in the first component of your primary key, in buckets. E.g. you could have day buckets and duplicate your queries for each day that contains data you need. This method guarantees that most of the data Cassandra reads over is data that you want. The problem is that all data for the bucket (e.g. day) needs to fit in one partition. You can fix this by sharding the partition somehow e.g. include some aspect of the uid within it.

alllow filtering, data modeling in cql

1 Answers1

Linked