I would like to use Server-side data selection and filtering using the cassandra
spark connector. In fact we have many sensors that send values every 1s, we are interested on these data aggregation using months, days, hours, etc,
I have proposed the following data model:
CREATE TABLE project1(
year int,
month int,
load_balancer int,
day int,
hour int,
estimation_time timestamp,
sensor_id int,
value double,
...
PRIMARY KEY ((year, month, load_balancer), day, hour, estimation_time, sensor_id)
Then, we were interested to get the data aggregation of a 2014-December- with loadbalancer IN (0,1,2,3). So they are 4 different partitions.
We are using the cassandra
spark connector version 1.1.1, and we used a combine by query to get all values mean aggregated by hour.
So the processing time for 4,341,390 tuples, spark takes 11min to return the result. Now the issue is that we are using 5 nodes however spark uses only one worker to execute the task. Could you please suggest an update to the query or data model in order to enhance the performance?