k-means clustering using Apache spark Zeppelin

Question

I have csv file with three column (string) and I have this code for clustering on Zeppelin

This is the code:

case class kmeansScore(k: String, score: String,j: String  )
val rawData = sc.textFile("/resources/data/v1.csv")
rawData.map(_.split(',').last).countByValue().toSeq.sortBy(_._2).reverse.foreach(println)

import org.apache.spark.mllib.linalg._
val labelsAndData = rawData.zipWithIndex.flatMap {
  case (line,index) =>
    if (index == 0) {
      None
    } else {
      val buffer = line.split(',').toBuffer
      buffer.remove(1, 4)
      val label = buffer.remove(buffer.length-1)
      val vector = Vectors.dense(buffer.map(_.toDouble).toArray)
      Some((label,vector))
    }
}
import org.apache.spark.mllib.clustering._   
def distance(a: Vector, b: Vector) = math.sqrt(a.toArray.zip(b.toArray).map(p => p._1 - p._2).map(d => d * d).sum)
def distToCentroid(datum: Vector, model: KMeansModel) = {
  val cluster = model.predict(datum)
  val centroid = model.clusterCenters(cluster)
  distance(centroid, datum)
}

import org.apache.spark.rdd._
val dataAsArray = labelsAndData.values.map(_.toArray).cache()
val dataAsArray.first().length

But I got this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 90.0 failed 1 times, most recent failure: Lost task 0.0 in stage 90.0 (TID 138, localhost): java.lang.IndexOutOfBoundsException: 1
    at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:158)

What is the problem? I'm working on Zeppelin in https://my.datascientistworkbench.com/tools/zeppelin-notebook/

What is the utility of that link to anyone here who does not have an account there. Are we supposed to look at the notebook? — Matti Lyra, May 31 '17 at 15:29

score 0 · Answer 1 · answered Jun 01 '17 at 06:53

Learn how to read stack traces. They will tell you where the error is. Without a proper stack trace, we can only speculate.

In particular, which line in your code does it fail?

My blind guess is that the last line is empty, and then the "remove" fails with this exception.

Your code looks fairly inefficient to me. I can only recommend to benchmark it with other tools, including non-spark tools. I know people love all the "functional" stuff like map and zip. But you'll be surprised by how slow this can be when doing numerical stuff.

k-means clustering using Apache spark Zeppelin

1 Answers1