How to calculate the average for last "n" days in spark structured streaming

Question

I am looking at computing average steps taken my the user for "Last 7 days". The device can push multiple records for the total steps taken along with the timestamp. The indicative schema is like below

id |participant_info.participant_id|value|event_info.started_at|event_info.ended_at|

The high level code I have is as follows:

public Dataset<Row> evaluate(Dataset<Row> input) {

    Column windowSpec =
        functions.window(functions.col("event_info.ended_at"), "7 days", "1 days");

    Column averageColumn = functions.round(functions.avg("value")).as("value");
    Column groupedParticipant = functions.col("participant_info.participant_id");
    Column groupedDate = functions.col("event_info.ended_at");

    input.groupBy(windowSpec, groupedParticipant).count().show(false);

    return input
        //.withWatermark("event_info.ended_at", this.waterMarkDuration + " minutes")
        .groupBy(windowSpec, groupedParticipant)
        .agg(averageColumn)
        .select(
            col("participant_id"),
            col("window.start").alias("from"),
            col("window.end").alias("till"),
            col(this.variableName).alias("value"),
            col(this.variableName));
  }

As seen above, I am trying to create a group for each sliding window and then aggregate.

Test Data

Let me take a sample input data as follows:

+-----+----------------+-----+------------------------------------------+
|id   |participant_info|value|event_info                                |
+-----+----------------+-----+------------------------------------------+
|step1|{participant-1} |100  |{2023-07-05 05:30:00, 2023-07-05 05:30:00}|
|step1|{participant-1} |1000 |{2023-07-04 05:30:00, 2023-07-04 05:30:00}|
|step1|{participant-1} |700  |{2023-07-04 05:30:00, 2023-07-04 05:30:00}|
+-----+----------------+-----+------------------------------------------+

The output I get is

+--------------+-------------------+-------------------+-----+----------+
|participant_id|from               |till               |value|step_count|
+--------------+-------------------+-------------------+-----+----------+
|participant-1 |2023-06-29 05:30:00|2023-07-06 05:30:00|600.0|600.0     |
|participant-1 |2023-06-28 05:30:00|2023-07-05 05:30:00|850.0|850.0     |
|participant-1 |2023-07-05 05:30:00|2023-07-12 05:30:00|100.0|100.0     |
|participant-1 |2023-07-03 05:30:00|2023-07-10 05:30:00|600.0|600.0     |
|participant-1 |2023-06-30 05:30:00|2023-07-07 05:30:00|600.0|600.0     |
|participant-1 |2023-07-01 05:30:00|2023-07-08 05:30:00|600.0|600.0     |
|participant-1 |2023-07-02 05:30:00|2023-07-09 05:30:00|600.0|600.0     |
|participant-1 |2023-07-04 05:30:00|2023-07-11 05:30:00|600.0|600.0     |
+--------------+-------------------+-------------------+-----+----------+

I can see the compute the average is doing is the sum(value)/total_records for each grouped bucket.

However, since these are steps, I am looking at number of days being the denominator while computing. In case of group between 2023-06-29 05:30:00|2023-07-06 05:30:00 even though there are 3 records, its essentially 2 days of records. Either the denominator can be 7 days or 2 days.

Is there a way achieve this with spark structured streaming?

Update 1:

I am currently achieving this by manual method like

public Dataset<Row> evaluate(Dataset<Row> input) {
    Column windowSpec =
        functions.window(
            functions.col("event_info.ended_at"),
            this.windowDuration + " minutes",
            this.slideDuration + " minutes");

    Column sumColumn = functions.sum("value").as("sum_value");
    Column countColumn =
        functions
            .countDistinct(functions.date_trunc("day", functions.col("event_info.ended_at")))
            .as("count_days");
    Column groupedParticipant = functions.col("participant_info.participant_id");
    Column averageColumn = functions.col("sum_value").divide(functions.col("count_days")).as("value");


    Dataset<Row> aggregatedData =
        input
            .withWatermark("event_info.ended_at", this.waterMarkDuration + " minutes")
            .groupBy(windowSpec, groupedParticipant)
            .agg(sumColumn, countColumn);
    aggregatedData.show(false);

    Dataset<Row> result = aggregatedData.withColumn(this.variableName, averageColumn)
        .select(
            functions.col("participant_id"),
            functions.col("window.start").alias("from"),
            functions.col("window.end").alias("till"),
            functions.col(this.variableName).alias("value"),
            functions.col(this.variableName));

    return result;
  }

How to calculate the average for last "n" days in spark structured streaming

Test Data

Update 1:

0 Answers0