7

I have the following table defined in Cassandra 2.0.9:

CREATE TABLE history
(
    histid      uuid,
    ddate           text,       -- Day Date, i.e. 2014-11-20
    valtime         timestamp,  -- value time
    val             text,       --value
    PRIMARY KEY ((histid , ddate), valtime )
)
WITH CLUSTERING ORDER BY (valtime desc)
;

Scripts insert several thousand rows into this table daily.

I need to be able to select from this table knowing only the histid. However, i've partitioned the rows using (histid , ddate). Meaning, I have a full day of history values per row.

In order to select from this table for a particular histid, I also need to provide the ddate column. For example:

SELECT * FROM history
WHERE histid= cebc4c80-daa6-11e3-bcc2-005056a975a4
AND ddate = '2014-05-16'
;

To get the most recent value, I can do the following:

SELECT * FROM history
WHERE histid= cebc4c80-daa6-11e3-bcc2-005056a975a4
AND ddate = '2014-05-16'
LIMIT 1
;

However, If i want the most recent value for any given histid, I can't submit the query without knowing what ddate is, since it is part of the partition key.

So...I ask, what would be the best way to approach this?

This is what i've done, but i don;'t know if it's reasonable:

I've created a secondary table:

 CREATE TABLE history_date
(
    histid          uuid,
    maxdate         timestamp, -- most recent date
    PRIMARY KEY (histid)
);

When a row is inserted into the history table, a row is also inserted into this table using, (histid, valtime).

Our program code can then:

1.  query the history_date table for a particular id
2. take the "maxdate" column (truncate it to yyyy-mm-dd)
3. use the histid and truncated maxdate to query the history table to retrieve the most recent value.

So this works. But, it doesn't really feel like a good solution.

Is there a better way to do this, perhaps with just a single table?

Thanks for your time.

Aaron
  • 55,518
  • 11
  • 116
  • 132
iamtheoracle
  • 317
  • 2
  • 11
  • 4
    Data modeling for Cassandra is a different mindset, and the redundant data / query table approach is the correct one to take. – Aaron Jan 27 '15 at 16:17
  • @BryceAtNetwork23 - Thanks. Yes, it certainty is difficult to wrap your head around sometimes. Appreciate the feedback. – iamtheoracle Jan 27 '15 at 16:29
  • 1
    @iamtheoracle It is difficult. The crux is to recognize that a different mindset is needed, and you've passed that hurdle, so you're on the right path. Further than many who never seem to "get it." :) Stick with it, and it will make more sense over time. – Don Branson Jan 27 '15 at 17:52
  • 1
    @DonBranson Thanks 20+ years using the "other" major databases has ruined me :) – iamtheoracle Jan 27 '15 at 21:12
  • @iamtheoracle Heh. Well, I've been coding long enough that I remember the last paradigm shift *to* SQL databases. – Don Branson Jan 27 '15 at 21:55

3 Answers3

1

One thing you could try, is to build a new table partitioned on a wider date range, such as month. This way, you just need to know the month to query.

CREATE TABLE history_by_month(
    histid          uuid,
    ddate           text,       -- Day Date, i.e. 2014-11-20
    valtime         timestamp,  -- value time
    val             text,       --value
    month           text,
    PRIMARY KEY (month, valtime, histid))
WITH CLUSTERING ORDER BY (valtime desc, histid asc);

Now, this query should return what you're looking for:

SELECT * FROM history_by_month
WHERE month = '2014-05'
LIMIT 1;

The only thing to keep in mind, is that if you get too many entries in a single month, you run into the chance of your partitions to be too large. If that becomes an issue, you might consider narrowing that focus to week, perhaps.

Also, anyone still on 2.0.9 should consider upgrading. Even the most recent patch level of 2.1 is much more stable.

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • Having a bigger partition doesn't solve the problem. It's still the same question: how do I figure out the month to use for the query. I just wonder if there is a generic solution except for the one OP has come up with? – Ihor Kaharlichenko Jun 25 '19 at 13:58
  • @IhorKaharlichenko unfortunately, Cassandra has very strict query requirements, and some knowledge of the timeframe is going to be necessary to narrow that down a little. – Aaron Jun 25 '19 at 14:04
0

Well, as you've mentioned, you cannot select with only knowing one value of a table partitioned with two keys. However, clustering order and order by - limit options may help, which you've have alread utilized.

http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refClstrOrdr.html

https://cassandra.apache.org/doc/cql3/CQL.html#selectStmt

safato
  • 118
  • 2
  • 6
0

I think the solution is straight forward, there is no need to complicate things. just make partition key on 'histid' and clustering key on 'ddate'. So your DDL should look like below

CREATE TABLE history ( histid uuid, ddate text, valtime timestamp, val text, PRIMARY KEY ((histid) , ddate, valtime ) ) ;

You can query from any below combination (but ensure the same order in your where clause) a) query on only histid b) query on histid and dddate c) query on histid , dddate and valtime

Let me know if this works for you or you still have question ?