Cassandra data aggregation by Spark

Question

I would like to use Server-side data selection and filtering using the cassandra spark connector. In fact we have many sensors that send values every 1s, we are interested on these data aggregation using months, days, hours, etc, I have proposed the following data model:

CREATE TABLE project1(      
      year int,
      month int,
      load_balancer int,
      day int,
      hour int,
      estimation_time timestamp,
      sensor_id int,
      value double, 
      ...
      PRIMARY KEY ((year, month, load_balancer), day, hour, estimation_time, sensor_id)

Then, we were interested to get the data aggregation of a 2014-December- with loadbalancer IN (0,1,2,3). So they are 4 different partitions.

We are using the cassandra spark connector version 1.1.1, and we used a combine by query to get all values mean aggregated by hour.

So the processing time for 4,341,390 tuples, spark takes 11min to return the result. Now the issue is that we are using 5 nodes however spark uses only one worker to execute the task. Could you please suggest an update to the query or data model in order to enhance the performance?

When using sensor_id as a partition key, all the nodes will be used (we have around 500 sensors). But, in the proposed model I do not know why they are not 4 jobs. May be because they are in the same node! — Wassim, May 04 '15 at 14:24
Did you verify that your data is stored on only 1 node? (which sounds weird and unbalanced) also can you show your spark code? — HashtagMarkus, May 05 '15 at 13:09
@Zerd1984 how to check if the data is in one node or not ? if it is with tracing of a cassandra query, I think it is distributed. [Cassandra trace](https://drive.google.com/file/d/0B75V1KvBteFOSWpWNm9sR2NUNk0/view?usp=sharing) . Concerning Spark code, I am using this one [code](http://stackoverflow.com/questions/28806792/spark-combinebykey-java-lambda-expression) — Wassim, May 06 '15 at 07:47
@Wassim you can use "nodetool getendpoints keyspace table partitionkey" to see where your partitions are stored — HashtagMarkus, May 06 '15 at 10:23
@Zerd1984 Yes! they are distributed. [LINK](https://docs.google.com/document/d/18cF6jyNvdO3_7PfYQJZKhtfOmJKU0jZuRkbsMiJn4Bc/edit?usp=sharing) — Wassim, May 06 '15 at 10:35
@Wassim the code you provided is no cassandra specific operation... ? — HashtagMarkus, May 06 '15 at 14:34

score 0 · Answer 1 · answered May 06 '15 at 20:36

Spark Cassandra Connector has this feature, it is SPARKC-25. You can just create an arbitrary RDD with values and then use it as a source of keys to fetch data from Cassandra table. Or in other words - join an arbitrary RDD to Cassandra RDD. In your case, that arbitrary RDD would include 4 tuples with different load balancer values. Look at the documentation for more info. SCC 1.2 has been released recently and it is probably compatible with Spark 1.1 (it is designed for Spark 1.2 though).

Cassandra data aggregation by Spark

1 Answers1