1

I have a spark streaming application that is reading data off a Kinesis Stream and aggregating metrics gathered from that data. My problem is that I want to use breeze to get descriptive stat values given a distribution of data, like mean, variance and percentiles. Since this is a spark job, I have a serializable class that I use to store all these values for a given "key" value.

I have made sure that my breeze version is compatible with the scala version that spark is using (2.11 since im using spark 2.0.0). Here is the relevant portions of my pom.xml (where spark.version = 2.0.0 and scala.binary.version = 2.11)

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.scalanlp</groupId>
        <artifactId>breeze_${scala.binary.version}</artifactId>
        <version>0.12</version>
    </dependency>

And here is the class in which I am using breeze's DescriptiveStats methods :

import breeze.stats._

@SerialVersionUID(100L)
class ResponseDuration(key: String, distribution: Vector[Double]) extends Serializable {
  val durationMeanAndVariance = meanAndVariance(distribution)
  val duration95p = DescriptiveStats.percentile(distribution, 0.95)
  val duration99p = DescriptiveStats.percentile(distribution, 0.99)
  val duration75p = DescriptiveStats.percentile(distribution, 0.75)
  val durationMean = durationMeanAndVariance.mean
  val count = durationMeanAndVariance.count
  val variance = durationMeanAndVariance.variance
  val metricValues = Map( 
    s"$key.duration_95p" -> duration95p,
    s"$key.duration_99p" -> duration99p,
    s"$key.duration_75p" -> duration75p,
    s"$key.duration_mean" -> durationMean,
    s"$key.count" -> count
  )
}

The problem I am facing is that when I try to run this spark job locally I get the following error :

    6/09/10 22:54:48 WARN TaskSetManager: Lost task 0.0 in stage 70.0 (TID 107, localhost): java.lang.NoSuchMethodError: breeze.stats.package$.meanAndVariance()Lbreeze/stats/DescriptiveStats$meanAndVariance$;
        at com.xxx.www.xxxxx.metrics.ResponseDuration.<init>(ResponseDuration.scala:12)
        at com.xxx.www.xxxxx.ServiceMetricsTransformer$$anonfun$8.apply(ServiceMetricsTransformer.scala:86)
        at com.xxx.www.xxxxx.ServiceMetricsTransformer$$anonfun$8.apply(ServiceMetricsTransformer.scala:85)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
        at scala.collection.AbstractIterator.to(Iterator.scala:1336)
        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)

Finally this is the place in my spark application where i am creating these objects from the extracted data

val durationDistributions = durationData.map {
  case (key, values) => (key, Vector(values))
}.reduceByKey(_ ++ _)

val durationMetrics = durationDistributions.map {
  case (key, values) => new ResponseDuration(key, values)
}

The confusing thing is that when I comment out the lines that call out to meanAndVariance and leave the lines that use the percentile method everything works fine. But if I have a call to meanAndVariance then I get the issue reported above.

I have done a clean package and force updated dependencies to see if that would help, since it most likely seems to be an issue with scala version incompibilities? But if that is the case how am I able to use percentile? I am just really confused at this point, maybe due to the fact that i'm new to both scala and spark, any help is appreciated!

yash.vyas
  • 163
  • 2
  • 7
  • It looks like an issue of binary incompatibility of your dependencies. I don't know how to do it in maven, but in SBT you can check the potential binary compatibility problems using "evicted" command. It will print what dependencies have been replaced by their newer version, and where have they been referenced from. – Haspemulator Sep 11 '16 at 11:24
  • Hey thanks for your input, I can try to figure that out in maven, but you are saying it could be an issue with binary compatibility even though i am able to use some methods in the lib but not others? I would have thought it was definitely the issue if i couldn't use any methods in the lib. – yash.vyas Sep 11 '16 at 19:37
  • That's a very real and not so rare problem, and the symptoms are exactly like you describe. – Haspemulator Sep 11 '16 at 19:49
  • Ok thanks for your help! I will update this once I find the fix with the incompatibly, for now I have a workaround in place. – yash.vyas Sep 17 '16 at 20:27

0 Answers0