I am reading data from cassandra partition to spark using cassandra-connector.I tried below solutions for reading partitions.I tried to parallelize the task by creating rdds as much as possible but both solution ONE and solution TWO had same performance .
In solution ONE , I could see the stages in spark UI immediately. I tried to avoid one for loop in solution TWO.
In solution TWO, the stages appear after a considerable amount of time.Also as the number of userids increases then there is significant increase in time before the stages appear in spark UI for solution TWO.
Version
spark - 1.1
Dse - 4.6
cassandra-connector -1.1
Setup
3 - Nodes with spark cassandra
Each node has 1 core dedicated to this task.
512MB ram for the executor memory.
My cassandra Table schema,
CREATE TABLE test (
user text,
userid bigint,
period timestamp,
ip text,
data blob,
PRIMARY KEY((user,userid,period),ip)
);
First solution:
val users = List("u1","u2","u3")
val period = List("2000-05-01","2000-05-01")
val partitions = users.flatMap(x => period.map(y => (x,y))))
val userids = 1 to 10
for (userid <- userids){
val rdds = partitions.map(x => sc.cassandraTable("test_keyspace","table1")
.select("data")
.where("user=?", x._1)
.where("period=?",x._2)
.where("userid=?,userid)
)
val combinedRdd = sc.union(rdds)
val result = combinedRdd.map(getDataFromColumns)
.coalesce(4)
.reduceByKey((x,y) => x+y)
.collect()
result.foreach(prinltn)
}
Second Solution:
val users = List("u1","u2","u3")
val period = List("2000-05-01","2000-05-01")
val userids = 1 to 10
val partitions = users.flatMap(x => period.flatMap(
y => userids.map(z => (x,y,z))))
val rdds = partitions.map(x => sc.cassandraTable("test_keyspace","table1")
.select("data")
.where("user=?", x._1)
.where("period=?",x._2)
.where("userid=?,x._3)
)
val combinedRdd = sc.union(rdds)
val result = combinedRdd.map(getDataFromColumns)
.coalesce(4)
.reduceByKey((x,y) => x+y)
.collect()
result.foreach(prinltn)
Why the solution TWO is not faster than solution ONE ?
My Understanding is that since all the partitions are queried at one stretch and data is distributed across nodes it should be faster. Please correct me if I am wrong.