Suppose that I have a Cassandra DB and I need to process a big bunch of data that I can query with a SELECT. The problem is that the processing is too slow and I'd like to use a distributed system to do the work. How can I reshape the CQL query so that I can get only a chunk of the datas?
I know that I can get a limited number of rows using the LIMIT ability of CQL, but I would need something more like LIMIT and OFFSET so that each process can get an independant chunk of data. (Is OFFSET something that will be ultimately implemented in CQL? I've read that it would be inefficient, is that the reason why it is not implemented?)
I would like to avoid waiting for the end of a query to start the next one, as suggested in Cassandra pagination: How to use get_slice to query a Cassandra 1.2 database from Python using the cql library. This would keep processes idle while waiting for the previous queries to complete.
As example, suppose that I'd like to process weather data and for the moment, my table looks like (I could use other data type for the storage, such as timeuuid for time, this is just a dummy problem):
CREATE TABLE weather_data (
station varchar,
date varchar,
time varchar,
value double,
PRIMARY KEY ( (station,date), time )
);
For a given station and date, I'd like to create chunks of data (based on time). I can suppose that I know how many measures I have for each station and date.
If the right answer is "change the structure of the table", I would be glad to see how to modify it.