-1

I have a dataframe :

val DF = {spark.read.option("header", value = true).option("delimiter", ";").csv(path_file)}

val cord = DF.select("time","longitude", "latitude","speed")

I want to calculate z score (x-mean)/std of each row of speed column.I calculate the mean and standard deviation :

val std = DF.select(col("speed").cast("double")).as[Double].rdd.stdev()
val mean = DF.select(col("speed").cast("double")).as[Double].rdd.mean()

How to calculate z score for each row of column speed and obtain this result :

+----------------+----------------+-   
|A               |B               |speed    |   z score      
+----------------+----------------+---------------------+
|17/02/2020 00:06|      -7.1732833|   50    |    z score     

|17/02/2020 00:16|      -7.1732833|   40    |    z score   

|17/02/2020 00:26|      -7.1732833|   30    |    z score

How to do for calcule it for each row.

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • Is it make sense to calculate the score for each row with your mean and std? Or are you want to calculate mean and std from the 1st row to the current row for each row? – Lamanus Feb 29 '20 at 22:22
  • I want to calculate z score for each row with the same mean and std like this (50-mean/std) and (40-mean/std) ................ –  Feb 29 '20 at 22:26
  • 1
    I see. Then use `df.withColumn("z test", col("speed") - mean / std)` – Lamanus Feb 29 '20 at 22:29
  • thanks it works for me . –  Feb 29 '20 at 22:33

2 Answers2

1

The best way to perform this is:

df.withColumn("z score", col("speed") - mean / std)

where mean and std are calculated as shown in the question.

Let me know if this helps!!

Anand Sai
  • 1,566
  • 7
  • 11
0

You can avoid two separate RDD actions of you use window-functions and STDDEV_POP from Hive aggregate functions:

val DF = {spark.read.option("header", value = true).option("delimiter", ";").csv(path_file)}

val cord = DF.select($"time",$"longitude", $"latitude",$"speed".cast("double"))

val result = cord
  .withColumn("mean",avg($"speed").over())
  .withColumn("stddev",callUDF("stddev_pop",$"speed").over())
  .withColumn("z-score",($"speed"-$"mean")/$"stddev") 
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145