7

I have this table:

CREATE TABLE custumer_events_service.events_by_websiteId_time(
    "event_id" text,
    "currentTime" timestamp,
    "websiteId" varchar,

    OTHER COLUMNS ...

    PRIMARY KEY(event_id, websiteId, currentTime)
)

In this case, would I get 10000 rows ordered by currentime when I execute this query:

SELECT * FROM events_by_websiteid_time WHERE websiteid='xxxx' LIMIT 10000 ALLOW FILTERING;

Or did I have to add WITH CLUSTERING ORDER BY (currentTime DESC); at the end?

farhawa
  • 10,120
  • 16
  • 49
  • 91

1 Answers1

8

Cassandra can only enforce a sort order within a partition. As you are using ALLOW FILTERING to avoid having to provide your partition key (event_id) your result set will be ordered by the hashed token values of each event_id, and then by websiteid and currentTime.

To get your results to be ordered by currentTime, you would need to either create a new query table or alter the PRIMARY KEY definition (and perhaps the CLUSTERING ORDER) of your existing table. If you decide to create a new query table, it would have to look something like this:

CREATE TABLE custumer_events_service.events_by_websiteId_time_eventid(
  event_id text,
  currentTime timestamp,
  websiteId varchar,

OTHER COLUMNS ...

  PRIMARY KEY (websiteid,currentTime,event_id))
WITH CLUSTERING ORDER BY (currentTime DESC, event_id ASC);

That would allow this query:

SELECT * FROM events_by_websiteid_time_eventid WHERE websiteid='xxxx' LIMIT 10000;

...to work as you expect.

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • +1 Thanks for your answer Aron (once again ;-) ), what are the downsides of the first solution ( using CLUSTERING ORDER )? wiIth the second solution, I am not risking to disbalance my cluster as I have a non-equivalent distribution of events by websiteId? – farhawa Mar 01 '16 at 19:14
  • @farhawa The downside of the first solution, is that involves duplicating your data into a new query table. But if you need to serve queries for keyed off of both `websiteid` and `event_id`, then you may need both tables...trading disk for performance. As for the imbalance in the 2nd solution, if it is a drastic difference, then maybe it would make sense to add an additional "bucket" to help partition your data like day or month, or something else that makes sense for your use case. – Aaron Mar 01 '16 at 20:16
  • As I am learning cassandra, I have posted a new question http://stackoverflow.com/questions/35880249/cassandra-the-same-query-work-with-cql-but-not-with-python-driver can you take a look? Thanks in advance – farhawa Mar 08 '16 at 23:49
  • @Aaron What do you mean by "a query table"? Is it somehow different from "a table"? – Ivan Balashov Sep 19 '20 at 15:05
  • 1
    @IvanBalashov The term "query table" is used to describe a table which is is designed to hold the same data as an existing table, but with a different PRIMARY KEY definition. This allows it to support querying the same data with a different query. It is a functional term; query tables are technically the same as "normal" tables. – Aaron Sep 19 '20 at 16:42