I have 2 streams, one Order stream, other for booking all taxi available for booking after order was accepted and other for all taxi booked.
Order Stream :
Order_id : string
timeCreated : Timestamp
region_id : string
Taxi Available Stream :
taxi_ids : list[string]
timeCreated : timestamp
Order_id : string
Application Flow is basically, once the order is placed, system checks if this can be taken up and then we find all the available taxis from which user can choose, which is shown to the user.
the taxi-available stream event is created after order event by next 2 minutes. that is
timeCreated(Order) < timeCreated(taxi-available) < timeCreated(Order) + 2 Minutes
I want to calculate the aggregate for a 10 minute window, where I like to average all the no. of taxi available for a given region_id
My SQL look like this :
SELECT o.region_id, avg(length(ta.taxis))
FROM order o JOIN taxi_available ta
ON o.order_id=ta.order_id
AND
ta.timeCreated BETWEEN o.timeCreated AND o.timeCreated + INTERVAL '2' Minute
GROUP BY TUMBLE(o.timeCreated, INTERVAL '10' Minute), o.region_id
I like to know what is wrong in this approach, Does time-window joins work for aggregates??
My doubt is that
it seems the Tumble(o.timeCreated, INTERVAL '10' Minute) is operating on both streams and making my query calculate average for same 10 minute data on both,
i.e. say order-stream from 2:00 till 2:10 then the above computation should take 2:00 till 2:12 from taxi_available stream, but instead it took from 2:00 till 2:10 for both streams
In which case the events coming in last 2 minutes of order streams (which will find the taxi-available events coming in next 2 minutes) may not find matching row and this average is not actually for 10 minutes orders rather less than 10 minutes then!!,
If that is true ?? what can be done for this ?