I am looking at computing average steps taken my the user for "Last 7 days". The device can push multiple records for the total steps taken along with the timestamp. The indicative schema is like below
id |participant_info.participant_id|value|event_info.started_at|event_info.ended_at|
The high level code I have is as follows:
public Dataset<Row> evaluate(Dataset<Row> input) {
Column windowSpec =
functions.window(functions.col("event_info.ended_at"), "7 days", "1 days");
Column averageColumn = functions.round(functions.avg("value")).as("value");
Column groupedParticipant = functions.col("participant_info.participant_id");
Column groupedDate = functions.col("event_info.ended_at");
input.groupBy(windowSpec, groupedParticipant).count().show(false);
return input
//.withWatermark("event_info.ended_at", this.waterMarkDuration + " minutes")
.groupBy(windowSpec, groupedParticipant)
.agg(averageColumn)
.select(
col("participant_id"),
col("window.start").alias("from"),
col("window.end").alias("till"),
col(this.variableName).alias("value"),
col(this.variableName));
}
As seen above, I am trying to create a group for each sliding window and then aggregate.
Test Data
Let me take a sample input data as follows:
+-----+----------------+-----+------------------------------------------+
|id |participant_info|value|event_info |
+-----+----------------+-----+------------------------------------------+
|step1|{participant-1} |100 |{2023-07-05 05:30:00, 2023-07-05 05:30:00}|
|step1|{participant-1} |1000 |{2023-07-04 05:30:00, 2023-07-04 05:30:00}|
|step1|{participant-1} |700 |{2023-07-04 05:30:00, 2023-07-04 05:30:00}|
+-----+----------------+-----+------------------------------------------+
The output I get is
+--------------+-------------------+-------------------+-----+----------+
|participant_id|from |till |value|step_count|
+--------------+-------------------+-------------------+-----+----------+
|participant-1 |2023-06-29 05:30:00|2023-07-06 05:30:00|600.0|600.0 |
|participant-1 |2023-06-28 05:30:00|2023-07-05 05:30:00|850.0|850.0 |
|participant-1 |2023-07-05 05:30:00|2023-07-12 05:30:00|100.0|100.0 |
|participant-1 |2023-07-03 05:30:00|2023-07-10 05:30:00|600.0|600.0 |
|participant-1 |2023-06-30 05:30:00|2023-07-07 05:30:00|600.0|600.0 |
|participant-1 |2023-07-01 05:30:00|2023-07-08 05:30:00|600.0|600.0 |
|participant-1 |2023-07-02 05:30:00|2023-07-09 05:30:00|600.0|600.0 |
|participant-1 |2023-07-04 05:30:00|2023-07-11 05:30:00|600.0|600.0 |
+--------------+-------------------+-------------------+-----+----------+
I can see the compute the average is doing is the sum(value)/total_records
for each grouped bucket.
However, since these are steps, I am looking at number of days being the denominator while computing. In case of group between 2023-06-29 05:30:00|2023-07-06 05:30:00
even though there are 3 records, its essentially 2 days of records. Either the denominator can be 7 days or 2 days.
Is there a way achieve this with spark structured streaming?
Update 1:
I am currently achieving this by manual method like
public Dataset<Row> evaluate(Dataset<Row> input) {
Column windowSpec =
functions.window(
functions.col("event_info.ended_at"),
this.windowDuration + " minutes",
this.slideDuration + " minutes");
Column sumColumn = functions.sum("value").as("sum_value");
Column countColumn =
functions
.countDistinct(functions.date_trunc("day", functions.col("event_info.ended_at")))
.as("count_days");
Column groupedParticipant = functions.col("participant_info.participant_id");
Column averageColumn = functions.col("sum_value").divide(functions.col("count_days")).as("value");
Dataset<Row> aggregatedData =
input
.withWatermark("event_info.ended_at", this.waterMarkDuration + " minutes")
.groupBy(windowSpec, groupedParticipant)
.agg(sumColumn, countColumn);
aggregatedData.show(false);
Dataset<Row> result = aggregatedData.withColumn(this.variableName, averageColumn)
.select(
functions.col("participant_id"),
functions.col("window.start").alias("from"),
functions.col("window.end").alias("till"),
functions.col(this.variableName).alias("value"),
functions.col(this.variableName));
return result;
}