calculate the z score for each row in the column of a dataframe using scala / spark

Question

I have a dataframe :

val DF = {spark.read.option("header", value = true).option("delimiter", ";").csv(path_file)}

val cord = DF.select("time","longitude", "latitude","speed")

I want to calculate z score (x-mean)/std of each row of speed column.I calculate the mean and standard deviation :

val std = DF.select(col("speed").cast("double")).as[Double].rdd.stdev()
val mean = DF.select(col("speed").cast("double")).as[Double].rdd.mean()

How to calculate z score for each row of column speed and obtain this result :

+----------------+----------------+-   
|A               |B               |speed    |   z score      
+----------------+----------------+---------------------+
|17/02/2020 00:06|      -7.1732833|   50    |    z score     

|17/02/2020 00:16|      -7.1732833|   40    |    z score   

|17/02/2020 00:26|      -7.1732833|   30    |    z score

How to do for calcule it for each row.

Is it make sense to calculate the score for each row with your mean and std? Or are you want to calculate mean and std from the 1st row to the current row for each row? — Lamanus, Feb 29 '20 at 22:22
I want to calculate z score for each row with the same mean and std like this (50-mean/std) and (40-mean/std) ................ — , Feb 29 '20 at 22:26
I see. Then use `df.withColumn("z test", col("speed") - mean / std)` — Lamanus, Feb 29 '20 at 22:29

score 1 · Accepted Answer · answered Mar 01 '20 at 12:58

1

The best way to perform this is:

df.withColumn("z score", col("speed") - mean / std)

where mean and std are calculated as shown in the question.

Let me know if this helps!!

answered Mar 01 '20 at 12:58

Anand Sai

1,566
7
11

Raphael Roth · Answer 2 · 2020-03-01T13:46:42.257

You can avoid two separate RDD actions of you use window-functions and STDDEV_POP from Hive aggregate functions:

val DF = {spark.read.option("header", value = true).option("delimiter", ";").csv(path_file)}

val cord = DF.select($"time",$"longitude", $"latitude",$"speed".cast("double"))

val result = cord
  .withColumn("mean",avg($"speed").over())
  .withColumn("stddev",callUDF("stddev_pop",$"speed").over())
  .withColumn("z-score",($"speed"-$"mean")/$"stddev")

calculate the z score for each row in the column of a dataframe using scala / spark

2 Answers2