1

I'm using Spark MLLib k-Means which requires features have same dimensions. The features are calculated using histogram, so I have to use fixed-size bins. Hive has a build-in function histogram_numeric(col, b) - Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. What is the best way and how to use b fixed size bins in histogram?

wdz
  • 437
  • 1
  • 8
  • 18

1 Answers1

1

One possible way to handle this is to create an UDF like this

import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._

def get_bucket(breaks: Array[Double]) = udf(
  (x: Double) =>
     scala.math.abs(java.util.Arrays.binarySearch(breaks, x) + 1))

Lets assume your data looks similar to this:

val df = sc.parallelize(Seq(
  (1, 1.0), (1, 2.3), (1, 0.4), (1, 2.1), (1, 3.5), (1, 9.0),
  (2, 3.6), (2, 0.2), (2, 0.6), (2, 0.1), (2, 4.0), (2, -1.0)
)).toDF("k", "v")

where k identifies point and v is a value you want to use to compute histogram.

val breaks = Array(0.0, 1.0, 2.0, 3.0, 4.0)

val dfWithBuckets = df
  .withColumn("bucket", get_bucket(breaks)($"v"))
  .groupBy($"k", $"bucket")
  .agg(count(lit(1)))

dfWithBuckets.show()
// +---+------+--------+
// |  k|bucket|count(1)|
// +---+------+--------+
// |  1|     1|       1|
// |  1|     2|       1|
// |  1|     3|       2|
// |  1|     4|       1|
// |  1|     5|       1|
// |  2|     0|       1|
// |  2|     1|       3|
// |  2|     4|       1|
// |  2|     5|       1|
// +---+------+--------+

Finally above data can be collected, grouped and converted to vectors:

import org.apache.spark.mllib.linalg.Vectors

def toVector(xs: Iterable[(Int, Long)], n: Int) = {
    val sorted = xs.toArray.sorted
    val indices = sorted.map(_._1)
    val values = sorted.map(_._2.toDouble)
    Vectors.sparse(n, indices, values)
}

val vectors = dfWithBuckets.map{
  case Row(k: Int, b: Int, cnt: Long) =>
    (k, (b, cnt))}
  .groupByKey
  .mapValues(vs => toVector(vs, breaks.size + 1))

vectors.collect
// Array[(Int, org.apache.spark.mllib.linalg.Vector)] = Array(
//   (1,(6,[1,2,3,4,5],[1.0,1.0,2.0,1.0,1.0])),
//   (2,(6,[0,1,4,5],[1.0,3.0,1.0,1.0])))
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thanks for your helpful suggestion! I will use your approach if there is no other alternative. In my case, I have: `myDfWithBuckets.groupBy($"bucket").agg(callUDF("histogram_numeric", $"col1", lit(n)))`. If n is fixed size bins, then I'm done. The issue for histogram_numeric is that with n=40, sometimes I only get back 12 buckets, and sometimes 20 buckets. I need fixed number of buckets. – wdz Nov 18 '15 at 23:10