2

I am currently working on a small spark job to compute stock correlation matrix from a DStream.

From a DStream[(time, quote)], I need to aggregate quotes (double) by time (Long) among multiple rdds, before computing correlations (considering all quotes of the rdds)

dstream.reduceByKeyAndWindow{./*aggregate quotes in Vectors*/..} 
       .forEachRDD {rdd => Statistics.corr(RDD[Vector])}

To my mind, this could be a solution if the resulting dstream (from reduceByKeyAndWindow) contains only 1 rdd with all aggregated quotes.

But I am not sure. How is the data distributed after reduceByKeyAndWindow? Is there a way to merge rdds in a dstream?

Bart
  • 19,692
  • 7
  • 68
  • 77

0 Answers0