Is there a way to filter a PySpark stream using the average value calculated for a specific window of time?

Asked May 26 '23 at 07:41

Active May 28 '23 at 13:22

Viewed 47 times

Is there a way in pysark to use the avg calculated value to filter data from a main stream ?

I have this code to calculate average of heartbeat by lap.

df = spark.readStream.format("csv").schema(schema).option("header",True).load("/content/input")

# This is the part that interests you
avg_heartbeat_rate_per_lap = df \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy(
        session_window(df.timestamp, "5 minutes"),
        df.lapId) \
    .agg(avg("heartbeat"))

The code from Question to calculate avg by lap

Can I do something like that, to filter value above average and save them to a database.

df = df.where(df["heartbeat"] > avg_heartbeat_rate_per_lap.tail(1)["avg(heartbeat)"])

This line of code do not work, but looking for similar solution.

edited May 28 '23 at 13:22

Koedlt

4,286
8
15
33

asked May 26 '23 at 07:41

MrBigData

Is there a way to filter a PySpark stream using the average value calculated for a specific window of time?

0 Answers0