I want to collect time-series data and store it in snappydata store. I will be collecting millions of rows of data and I want to make queries across timeslices/ranges.
Here is an example query I want to do:
select avg(value)
from example_timeseries_table
where time >= :startDate and time < :endDate;
So, I am thinking that I want to have PARTITION BY COLUMN on time columns rather than the classic PRIMARY KEY column. In other technologies that I am familiar with like Cassandra DB, using the time columns in the partition key would point me directly at the partition and allow pulling the data for the timeslice in a single node rather than across many distributed nodes.
To be performant, I assume I need to partition by column 'time', in this table.
example_timeseries_table
------------------------
id int not nullable,
value varchar(128) not nullable,
time timestamp not nullable
PERSISTENT ASYNCHRONOUS
PARTITION BY COLUMN time
Is this the correct column to partition on for efficient, time-slice queries or do I need to make even more columns like: year_num, month_num, day_num, hour_num columns and PARTITION BY COLUMN on all of them as well, then do a query like this to focus the query to a particular partitioned node?:
select avg(value)
from example_table
where year_num = 2016
and month_num= 1
and day_num = 4
and hour_num = 11
and time >= :startDate and time < :endDate;