-1

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

I'm trying to write my own UDF for standard deviation for spark 1.5 and was hoping to see the implementation for 1.6. Thanks. If this is not possible, how would I go about writing a udf that calculates the standard deviation of a column given its columnName: (in scala):

def stddev(columnName: String): Column = {}

jojo
  • 73
  • 2
  • 10
  • def stddev(columnName: Column): Column = { sqrt(avg(columnName * columnName) - avg(columnName) * avg(columnName)) } This is what I came up with. Can anyone confirm if this is right? – jojo Jun 20 '16 at 18:51

1 Answers1

0

You can calculate the standard deviation with a UDF within an aggregation like so:

val df = sc.parallelize(Seq(1,2,3,4)).toDF("myCol")
df.show

>+-----+
>|myCol|
>+-----+
>|    1|
>|    2|
>|    3|
>|    4|
>+-----+

def stddev(col: Column): Column = sqrt(avg(col * col) - avg(col) * avg(col))
df.agg(stddev($"myCol")).first

> [1.118033988749895]
evan.oman
  • 5,922
  • 22
  • 43