Map reduce to perform group by and sum in Cassandra, with spark and job server

Question

I am creating a spark job server, which connects to cassandra. After getting the records i want to perform a simple group by and sum on it. I am able to retreive the data, I could not print the output. I have tried google on for hours and have posted in cassandra google groups as well. My current code is as below and i am getting error at collect.

 override def runJob(sc: SparkContext, config: Config): Any = {
//sc.cassandraTable("store", "transaction").select("terminalid","transdate","storeid","amountpaid").toArray().foreach (println)
// Printing of each record is successful
val rdd = sc.cassandraTable("POSDATA", "transaction").select("terminalid","transdate","storeid","amountpaid")
val map1 = rdd.map ( x => (x.getInt(0), x.getInt(1),x.getDate(2))->x.getDouble(3) ).reduceByKey((x,y)=>x+y)
println(map1)
// output is ShuffledRDD[3] at reduceByKey at Daily.scala:34
map1.collect
//map1.ccollectAsMap().map(println(_))
//Throwing error java.lang.ClassNotFoundException: transaction.Daily$$anonfun$2

}

Do you have spark cassandra connector runtime libraries on worker nodes? — noorul, May 06 '16 at 12:18
It's useful to keep in mind, that Spark is lazy - transformations are not applied till you call final action (like collect, take, foreach, etc). So, println does not force any computation, it just calls toString on RDD. So you can not be sure, that data was retrieved — Vitalii Kotliarenko, May 06 '16 at 17:53
@ noorul i have cassandra connect driver. The below line is printing records " sc.cassandraTable("store", "transaction").select("terminalid","transdate","storeid","amountpaid").toArray().foreach (println)" — Nideesh, May 07 '16 at 11:47

Cecil Pang · Answer 1 · 2016-05-11T10:49:03.933

0

Your map1 is a RDD. You can try the following:

map1.foreach(r => println(r))

edited May 11 '16 at 10:49

answered May 11 '16 at 10:43

Cecil Pang

58
1
4

score 0 · Answer 2 · answered Jun 10 '16 at 01:43

0

Spark does lazy evaluation on rdd. So try some action

   map1.take(10).foreach(println)

answered Jun 10 '16 at 01:43

Knight71

2,927
5
37
63

Map reduce to perform group by and sum in Cassandra, with spark and job server

2 Answers2