0

My requirement is to process or build some logic around the result of sql query in flink. For simplicity lets say I have two sql query they are running on different window size and one event stream. My question is

  • a) how I will know for which query result is this
  • b) how I will know how many rows are the result of executed query? I need this info as I have to build a notification message with list of event those are part of the query result.
DataStream<Event> ds = ...        
String query = "select id, key" +
                "  from  eventTable  GROUP BY TUMBLE(rowTime, INTERVAL '10' SECOND), id, key ";

        String query1 = "select id, key" +
                "  from  eventTable  GROUP BY TUMBLE(rowTime, INTERVAL '1' DAY), id, key ";
        List<String> list = new ArrayList<>();
        list.add(query);
        list.add(query1);
       
        tabEnv.createTemporaryView("eventTable", ds, $("id"), $("timeLong"), $("key"),$("rowTime").rowtime());


        for(int i =0; i< list.size(); i++ ){
            Table result = tabEnv.sqlQuery(list.get(i));
            DataStream<Tuple2<Boolean, Row>> dsRow = tabEnv.toRetractStream(result, Row.class);
            dsRow.process(new ProcessFunction<Tuple2<Boolean, Row>, Object>() {

            List<Row> listRow = new ArrayList<>();
            @Override
            public void processElement(Tuple2<Boolean, Row> booleanRowTuple2, Context context, Collector<Object> collector) throws Exception {
                listRow.add(booleanRowTuple2.f1);
            }
            });
        }

Appreciate your help. thanks Ashutosh

TobiSH
  • 2,833
  • 3
  • 23
  • 33
Ashutosh
  • 33
  • 8

1 Answers1

1

To sort out which results are from which query, you could include an identifier for each query in the queries themselves, e.g.,

SELECT '10sec', id, key FROM eventTable GROUP BY TUMBLE(rowTime, INTERVAL '10' SECOND), id, key

Determining the number of rows in the result table is trickier. One issue is that there is no final answer to the number of results from a streaming query. But where you are processing the results, it seems like you could count the number of rows.

Or, and I haven't tried this, but maybe you could use something like row_number() over(order by tumble_rowtime(rowTime, interval '10' second)) to annotate each row of the result with a counter.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Thank @David, I will try this solution. since result is in stream and even if I add identifier to query still it is difficult to say that query result is done and do the operation on set of this result. should I wait for few second and then assume that all rows are arrived in stream against identifier and good to process. please suggest – Ashutosh Aug 06 '20 at 12:28
  • If the input is bounded, you could use a batch rather than a streaming runtime, which would make it easier to know that the results are complete. – David Anderson Aug 06 '20 at 12:37
  • adding identifier '10sec' as suggested in possible solution will not solve the problem if getting query result back to back. for me input events are unbounded that the problem and stuck on to get the query result completeness :( – Ashutosh Sep 13 '20 at 10:35
  • @Ashutosh I'm not sure what you mean by "completeness". Watermarks may be used with event time timestamps to determine when each window is complete. But if the input stream is unbounded, then there is no end to the result stream. Each window will produce just one result, but there will be one window after another, forever, unless the input stops. – David Anderson Sep 14 '20 at 13:33