0

I have 2 streams, one Order stream, other for booking all taxi available for booking after order was accepted and other for all taxi booked.

Order Stream :

  1. Order_id : string

  2. timeCreated : Timestamp

  3. region_id : string

Taxi Available Stream :

  1. taxi_ids : list[string]

  2. timeCreated : timestamp

  3. Order_id : string

Application Flow is basically, once the order is placed, system checks if this can be taken up and then we find all the available taxis from which user can choose, which is shown to the user.

the taxi-available stream event is created after order event by next 2 minutes. that is

timeCreated(Order) < timeCreated(taxi-available) < timeCreated(Order) + 2 Minutes

I want to calculate the aggregate for a 10 minute window, where I like to average all the no. of taxi available for a given region_id

My SQL look like this :

SELECT o.region_id, avg(length(ta.taxis))
FROM order o JOIN taxi_available ta 
     ON o.order_id=ta.order_id 
        AND 
       ta.timeCreated BETWEEN o.timeCreated AND o.timeCreated + INTERVAL '2' Minute
GROUP BY TUMBLE(o.timeCreated, INTERVAL '10' Minute), o.region_id

I like to know what is wrong in this approach, Does time-window joins work for aggregates??

My doubt is that

it seems the Tumble(o.timeCreated, INTERVAL '10' Minute) is operating on both streams and making my query calculate average for same 10 minute data on both,

i.e. say order-stream from 2:00 till 2:10 then the above computation should take 2:00 till 2:12 from taxi_available stream, but instead it took from 2:00 till 2:10 for both streams

In which case the events coming in last 2 minutes of order streams (which will find the taxi-available events coming in next 2 minutes) may not find matching row and this average is not actually for 10 minutes orders rather less than 10 minutes then!!,

If that is true ?? what can be done for this ?

  • Can you explain why you are concerned? Have you tried this query on some test data? – David Anderson Mar 23 '20 at 19:50
  • @DavidAnderson yes, we are using similar query, I did more analysis, what is happening is it takes 10 mins data for order stream and wait till 2 more minutes of available-taxi stream data and then publish it, basically the computation is correct, just the publish time is 2 mins after each tumble window. – Nipun Jindal Mar 25 '20 at 15:36
  • It sounds then like everything is working perfectly. – David Anderson Mar 25 '20 at 15:44

0 Answers0