How can I see the code in Functions.Scala in Spark's github

Question

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

I'm trying to write my own UDF for standard deviation for spark 1.5 and was hoping to see the implementation for 1.6. Thanks. If this is not possible, how would I go about writing a udf that calculates the standard deviation of a column given its columnName: (in scala):

def stddev(columnName: String): Column = {}

def stddev(columnName: Column): Column = { sqrt(avg(columnName * columnName) - avg(columnName) * avg(columnName)) } This is what I came up with. Can anyone confirm if this is right? — jojo, Jun 20 '16 at 18:51

score 0 · Accepted Answer · answered Jun 20 '16 at 19:17

You can calculate the standard deviation with a UDF within an aggregation like so:

val df = sc.parallelize(Seq(1,2,3,4)).toDF("myCol")
df.show

>+-----+
>|myCol|
>+-----+
>|    1|
>|    2|
>|    3|
>|    4|
>+-----+

def stddev(col: Column): Column = sqrt(avg(col * col) - avg(col) * avg(col))
df.agg(stddev($"myCol")).first

> [1.118033988749895]

How can I see the code in Functions.Scala in Spark's github

1 Answers1