3

I am trying to store data with following schema:

 CREATE TABLE temp_humidity_data (
                asset_id text, 
                date text, 
                event_time timestamp, 
                temprature int, 
                humidity int,
                PRIMARY KEY((asset_id, date),event_time)
            )

I have followed datastax article 'Getting Started with Time Series Modeling' - http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/

however with this data model one thing that is not working is query that returns me data between two dates. How do I do that?

If I do this:

select * from temp_humidity_data 
where asset_id='1234' AND date >= '2010-04-02' AND date <= '2011-04-03';

It gives me following error:

code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key (unless you use the token() function)"

In understand there is a way to do IN operator but I don't want to put all those dates in a 'IN' operator. Is there a way to query when using the above table definition data between two dates?

Aaron
  • 55,518
  • 11
  • 116
  • 132
Subodh Nijsure
  • 3,305
  • 5
  • 26
  • 45

2 Answers2

6

The first key in the primary key (a composite in your case) is responsible for scattering the data across different partitions. Each partition is held in its entirety on a single node (and its replicas). Even if the query you request were possible (it would be if you had only the date as a primary and used a byteorderedpartitioner - the default is murmur3), it would effectively do a full scan across your cluster. Think of this being similar to a full table scan in an rdbms on a column without an index, only now the full table scan spans multiple machines.

The goal with the composite partition key here is to ensure no partition gets unmanageably big. It also takes away your ability to do the range query across dates. If you think your data for an asset can fit in a single partition, you can make the date the first clustering key. That would enable the query. However, all rows for an asset would be on a single partition. This may be an issue (it's typically good to target around 100MB as a max partition size - though there are exceptions), and hotspots may arise in your cluster (nodes holding partitions for very busy stuff will be busy, while other nodes less so). Another way around this is to maintain manual buckets - add a bucketid int as part of the partition key [i.e. (asset_id, bucket_id)], have date as the first clustering key, and maintain the bucketing from application code. This would distribute the data for an asset across multiple partitions that you control. This will need a bit of calculation, and will need you to query each bucket yourself - but will prevent hotspots, and allow your date range queries. You'd obviously only do this if the data for a particular asset is beyond single partition size, but manageable via buckets.

If you absolutely must partition based on date, consider things like Spark and Shark to do efficient post aggregation.

Hope that helps.

ashic
  • 6,367
  • 5
  • 33
  • 54
  • If I understand your statement - "If you think your data for an asset can fit in a single partition, you can make the date the first clustering key" - make my PRIMARY_KEY(date,asset_id),event_time Is that right? The part I don't understand is :all the entries for given asset will be in single partition" will this put limit on how many data points I can store for asset for ever or for a given date. In my setup I am polling asset every minute for a give date I am only going to have say 3600 samples. So I shouldn't hit the cassandra limit of 2B columns per row. Correct? – Subodh Nijsure Jul 23 '14 at 21:15
  • primary key(date, assetid) won't really help for your range query as the hash would still be composite. primary(date) can if you're using a ByteOrderedPartitioner - though that has its own problems (i.e. distribution across nodes). If you're polling every minute, you could od (asset_id, month) with 44640 entries per partition. That'll allow range query for each month. In fact, if your entries are small, u can do per year, w 527K entries per partition. – ashic Jul 23 '14 at 21:33
  • Quick back of envelope calculation: 1 year = 366 days = 366 * 24 * 60 minutes = 527040 That gives you just over 198 bytes per entry assuming a max partition size of 100MB. Of course, a few hundred meg won't hurt too bad, but having it per year might be convenient. Per month would definitely work as well. – ashic Jul 23 '14 at 21:35
  • Thank you so much for your help. I did some more calculation - even if I poll every second it would take me 63 years to reach cassandra limit of 2Billion columns per asset_id. So I am not going to do this compound key business. This is my second day working with cassandra if I do this -- not make compound primary key what trouble will I get into? – Subodh Nijsure Jul 23 '14 at 22:15
  • 2B is the max column size possible, but there are other considerations. For background operations, caching, etc. it's recommended to keep partitions small...100meg or maybe a few hundred meg. Performance will suffer, and maintenance operations may take longer. How long will the system "live"? If less than a year or two, I wouldn't bother. More, then I'd at least compound with a year. For per minute, partition per year seems ok, unless you have specific requirements based on cluster size, hotspot avoidance, etc. – ashic Jul 23 '14 at 23:13
2

No. Only between two times. (asset_id, date) Makes different dates live in different partitions. Since this combination of values is hashed together there is no way for C* to know where all the dates within a particular range live without searching every row.

If you want to do range scans based on date then date must be part of the clustering key not the partition key.

RussS
  • 16,476
  • 1
  • 34
  • 62