merge spark dStream with variable to saveToCassandra()

Question

I have a DStream[String, Int] with pairs of word counts, e.g. ("hello" -> 10). I want to write these counts to cassandra with a step index. The index is initialized as var step = 1 and is incremented with each microbatch processed.

The cassandra table created as:

CREATE TABLE wordcounts (
    step int,
    word text,
    count int,
primary key (step, word)
);

When trying to write the stream to the table...

stream.saveToCassandra("keyspace", "wordcounts", SomeColumns("word", "count"))

... I get java.lang.IllegalArgumentException: Some primary key columns are missing in RDD or have not been selected: step.

How can I prepend the step index to the stream in order to write the three columns together?

I'm using spark 2.0.0, scala 2.11.8, cassandra 3.4.0 and spark-cassandra-connector 2.0.0-M3.

Since you are trying to save the RDD to existing table, you need to include all the primary key columns. — Shankar, Nov 03 '16 at 03:56
How can I include all the primary key columns in the same statement? `a ++ b` works for concatenating lists, but `step ++ stream` fails with a type mismatch. — p3zo, Nov 03 '16 at 04:14
Since C* is timeseries db why not have a timestamp instead of step index ? — Knight71, Nov 03 '16 at 06:09

score 1 · Accepted Answer · answered Nov 03 '16 at 09:30

As noted, while the Cassandra table expects something of the form (Int, String, Int), the wordCount DStream is of type DStream[(String, Int)], so for the call to saveToCassandra(...) to work, we need a DStream of type DStream[(Int, String, Int)].

The tricky part in this question is how to bring a local counter, that is by definition only known in the driver, up to the level of the DStream.

To do that, we need to do two things: "lift" the counter to a distributed level (in Spark, we mean "RDD" or "DataFrame") and join that value with the existing DStream data.

Departing from the classic Streaming word count example:

// Split each line into words
val words = lines.flatMap(_.split(" "))

// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

We add a local var to hold the count of the microbatches:

@transient var batchCount = 0

It's declared transient, so that Spark doesn't try to close over its value when we declare transformations that use it.

Now the tricky bit: Within the context of a DStream transformation, we make an RDD out of that single variable and join it with underlying RDD of the DStream using cartesian product:

val batchWordCounts = wordCounts.transform{ rdd => 
  batchCount = batchCount + 1

  val localCount = sparkContext.parallelize(Seq(batchCount))
  rdd.cartesian(localCount).map{case ((word, count), batch) => (batch, word, count)}
}

(Note that a simple map function would not work, as only the initial value of the variable would be captured and serialized. Therefore, it would look like the counter never increased when looking at the DStream data.

Finally, now that the data is in the right shape, save it to Cassandra:

batchWordCounts.saveToCassandra("keyspace", "wordcounts")

This seems really close, but throws `java.io.NotSerializableException: Object of org.apache.spark.streaming.dstream.MappedDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.` — p3zo, Nov 03 '16 at 19:04
@p3zo I tested this. Could be something in the way you adapted your code? That exception seems to indicate that there's a `DStream` being referenced within an `RDD` transformation: *"This is because the DStream object is being referred to from within the closure."* That's not the case in the code provided. — maasg, Nov 03 '16 at 19:18
Right - my step index was still a `DStream` from a statement I forgot to remove. The solution you gave works as described. — p3zo, Nov 03 '16 at 19:29

score 0 · Answer 2 · answered Nov 16 '16 at 15:52

updateStateByKey function is provided by spark for global state handling. For this case it could look something like following

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount: Int = runningCount.getOrElse(0) + 1
    Some(newCount)
}
val step = stream.updateStateByKey(updateFunction _)

stream.join(step).map{case (key,(count, step)) => (step,key,count)})
   .saveToCassandra("keyspace", "wordcounts")

score -1 · Answer 3 · answered Nov 03 '16 at 04:17

Since you are trying to save the RDD to existing Cassandra table, you need to include all the primary key column values in the RDD.

What you can do is, you can use the below methods to save the RDD to new table.

saveAsCassandraTable or saveAsCassandraTableEx

For more info look into this.

merge spark dStream with variable to saveToCassandra()

3 Answers3