0

Suppose that I have a Cassandra DB and I need to process a big bunch of data that I can query with a SELECT. The problem is that the processing is too slow and I'd like to use a distributed system to do the work. How can I reshape the CQL query so that I can get only a chunk of the datas?

I know that I can get a limited number of rows using the LIMIT ability of CQL, but I would need something more like LIMIT and OFFSET so that each process can get an independant chunk of data. (Is OFFSET something that will be ultimately implemented in CQL? I've read that it would be inefficient, is that the reason why it is not implemented?)

I would like to avoid waiting for the end of a query to start the next one, as suggested in Cassandra pagination: How to use get_slice to query a Cassandra 1.2 database from Python using the cql library. This would keep processes idle while waiting for the previous queries to complete.


As example, suppose that I'd like to process weather data and for the moment, my table looks like (I could use other data type for the storage, such as timeuuid for time, this is just a dummy problem):

CREATE TABLE weather_data (
    station varchar,
    date varchar,
    time varchar,
    value double,
    PRIMARY KEY ( (station,date), time )
);

For a given station and date, I'd like to create chunks of data (based on time). I can suppose that I know how many measures I have for each station and date.

If the right answer is "change the structure of the table", I would be glad to see how to modify it.

Community
  • 1
  • 1
Dr_Sam
  • 1,818
  • 18
  • 24

2 Answers2

1

I change my answer since I misunderstood the original problem. What I would do is to break into other sub-chunks the information concerning station and date, for instance for day hour or for whatever is a reasonable division for you

CREATE TABLE weather_data (
    station varchar,
    date varchar,
    dayhour int,
    time varchar,
    value double,
    PRIMARY KEY ( (station,date), dayhour, time )
);

In this way you can split your data into 24 chunks and allowing parallel execution like I told before. This way you can split getting only first 2 hours for instance - the downside is that you will hit always the same nodes. An alternative could be to create such primary key:

PRIMARY KEY ( (station,date,dayhour), time )

This one will partition your data also based on dayhour, the side effect is that if you need to get all measurement from a given station in specific date you have to perform 24 queries. Last but not least solution could be denormalization (organize data to be sorted by hour in a new table and leave the original as is).

HTH, Carlo

Carlo Bertuccini
  • 19,615
  • 3
  • 28
  • 39
  • I need to know the exact query you'd like to perform. I understood you need to read a bunch of data from different stations/days -- is it? Or you need to read from same station/day little by little? – Carlo Bertuccini Oct 13 '14 at 08:29
  • What I would need is, given a station and a date, read the data little by little (so split the data according to the intra day time). Actually, I'd like to perform the same queries as Adam proposed in his answer, but without what I see as bottlenecks. – Dr_Sam Oct 13 '14 at 08:41
  • 1
    Thank you for taking time to answer my question! So you're adding the info of the hour so that the data are "pre-chunked"... Might work in my case as well. – Dr_Sam Oct 13 '14 at 09:56
0

You're on the right track using time as the clustering key.

First, I would recommend using a timeuuid for the 'time' clustering column to help avoid collisions.

Independent of that, slice queries within a partition are easily done using LIMIT and '>'. To iterate through samples in your table:

SELECT time, value FROM weather_data WHERE station='station_id' and date='date_here' LIMIT 5;
SELECT time, value FROM weather_data WHERE station='station_id' and date='date_here' AND time > [last time from previous query] LIMIT 5;
...

When you stop receiving data you've reached the end of this (station, date) partition. (client drivers have functions to get the minimum timeuuid for a given timestamp so you can often generalize to just use the second statement)

See also: short discussion and references linked here

Community
  • 1
  • 1
Adam Holmberg
  • 7,245
  • 3
  • 30
  • 53
  • Sorry for the varchar instead of the timeuuid, in my real case I use indeed timeuuids. Your solution is essentially the one from the link in my question, which is essentially a serial behaviour (make one query after the other). Is there no way to overcome that? – Dr_Sam Oct 13 '14 at 06:07
  • If the main concern is "to avoid waiting for the end of a query to start the next one", I would use the above query structure and parallelize using asynchronous requests. – Adam Holmberg Oct 13 '14 at 17:06