How to calculate z-score on dataframe API in ApacheSpark stucured streaming?

Question

I'm currently struggling with the following:

z-score is defined as:

z = (x-u)/sd

(where x is the individual value, u the mean of the window and sd the standard deviation of the window)

I can calculate u and sd on the window but don't know how to "carry over" each individual x value to the resulting dataframe in order to calculate the z-score for every value, this is how far I got so far:

val df = spark.readStream
    .format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")   
    .option("topic", "topic/path")
    .load("tcp://localhost:1883")

val counter = df.groupBy(
    window($"timestamp", "2 seconds"),
      $"value")
    .agg($"value",avg($"value")+stddev($"value"))

val query = counter.writeStream
  .outputMode("complete")
  .format("console")
  .start()

My hope was that $"value" in .agg($"value",avg($"value")+stddev($"value")) would carry over each value from the source data frame to the result, but this is not the case

Any ideas?

any idea why this question got down-voted? I'm still stuck on the same problem... — Romeo Kienzler, Mar 27 '17 at 21:44

score 0 · Accepted Answer · answered Mar 27 '17 at 22:11

I've found the answer now - the answer is that it is not possible because groupBy returns a org.apache.spark.sql.GroupedData object which does only support additional aggregations which (of course) doesn't allow access to individual values of the grouped rows. This post explains is very nicely,

How to calculate z-score on dataframe API in ApacheSpark stucured streaming?

1 Answers1