I have a spark streaming application that is reading data off a Kinesis Stream and aggregating metrics gathered from that data. My problem is that I want to use breeze to get descriptive stat values given a distribution of data, like mean, variance and percentiles. Since this is a spark job, I have a serializable class that I use to store all these values for a given "key" value.
I have made sure that my breeze version is compatible with the scala version that spark is using (2.11 since im using spark 2.0.0). Here is the relevant portions of my pom.xml (where spark.version = 2.0.0 and scala.binary.version = 2.11)
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.scalanlp</groupId>
<artifactId>breeze_${scala.binary.version}</artifactId>
<version>0.12</version>
</dependency>
And here is the class in which I am using breeze's DescriptiveStats methods :
import breeze.stats._
@SerialVersionUID(100L)
class ResponseDuration(key: String, distribution: Vector[Double]) extends Serializable {
val durationMeanAndVariance = meanAndVariance(distribution)
val duration95p = DescriptiveStats.percentile(distribution, 0.95)
val duration99p = DescriptiveStats.percentile(distribution, 0.99)
val duration75p = DescriptiveStats.percentile(distribution, 0.75)
val durationMean = durationMeanAndVariance.mean
val count = durationMeanAndVariance.count
val variance = durationMeanAndVariance.variance
val metricValues = Map(
s"$key.duration_95p" -> duration95p,
s"$key.duration_99p" -> duration99p,
s"$key.duration_75p" -> duration75p,
s"$key.duration_mean" -> durationMean,
s"$key.count" -> count
)
}
The problem I am facing is that when I try to run this spark job locally I get the following error :
6/09/10 22:54:48 WARN TaskSetManager: Lost task 0.0 in stage 70.0 (TID 107, localhost): java.lang.NoSuchMethodError: breeze.stats.package$.meanAndVariance()Lbreeze/stats/DescriptiveStats$meanAndVariance$;
at com.xxx.www.xxxxx.metrics.ResponseDuration.<init>(ResponseDuration.scala:12)
at com.xxx.www.xxxxx.ServiceMetricsTransformer$$anonfun$8.apply(ServiceMetricsTransformer.scala:86)
at com.xxx.www.xxxxx.ServiceMetricsTransformer$$anonfun$8.apply(ServiceMetricsTransformer.scala:85)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
Finally this is the place in my spark application where i am creating these objects from the extracted data
val durationDistributions = durationData.map {
case (key, values) => (key, Vector(values))
}.reduceByKey(_ ++ _)
val durationMetrics = durationDistributions.map {
case (key, values) => new ResponseDuration(key, values)
}
The confusing thing is that when I comment out the lines that call out to meanAndVariance and leave the lines that use the percentile method everything works fine. But if I have a call to meanAndVariance then I get the issue reported above.
I have done a clean package and force updated dependencies to see if that would help, since it most likely seems to be an issue with scala version incompibilities? But if that is the case how am I able to use percentile? I am just really confused at this point, maybe due to the fact that i'm new to both scala and spark, any help is appreciated!