Flink SQL: Repeating grouping keys in result of GROUP BY query

Question

I want to do a simple query in Flink SQL in one table which include a group by statement. But in the results there are duplicate rows for the column specified in the group by statement. Is that because I use a streaming environment and it doesn't remember the state ?

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
final StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
// configure Kafka consumer
Properties props = new Properties();
props.setProperty("bootstrap.servers", "localhost:9092"); // Broker default host:port
props.setProperty("group.id", "flink-consumer"); // Consumer group ID

FlinkKafkaConsumer011<BlocksTransactions> flinkBlocksTransactionsConsumer = new FlinkKafkaConsumer011<>(args[0], new BlocksTransactionsSchema(), props);
flinkBlocksTransactionsConsumer.setStartFromEarliest();

DataStream<BlocksTransactions> blocksTransactions = env.addSource(flinkBlocksTransactionsConsumer);


tableEnv.registerDataStream("blocksTransactionsTable", blocksTransactions);

Table sqlResult
        = tableEnv.sqlQuery(
                "SELECT block_hash, count(tx_hash) " +
                "FROM blocksTransactionsTable " +
                "GROUP BY block_hash");

DataStream<Test> resultStream = tableEnv
        .toRetractStream(sqlResult, Row.class)
        .map(t -> {
            Row r = t.f1;
            String field2 = r.getField(0).toString();
            long count = Long.valueOf(r.getField(1).toString());
            return new Test(field2, count);
        })
        .returns(Test.class);

resultStream.print();

resultStream.addSink(new FlinkKafkaProducer011<>("localhost:9092", "TargetTopic", new TestSchema()));

env.execute();

I use the group by statement for the block_hash column but I have several times the same block_hash. This is the result of the print() :

Test{field2='0x2c4a021d514e4f8f0beb8f0ce711652304928528487dc7811d06fa77c375b5e1', count=1} Test{field2='0x2c4a021d514e4f8f0beb8f0ce711652304928528487dc7811d06fa77c375b5e1', count=1} Test{field2='0x2c4a021d514e4f8f0beb8f0ce711652304928528487dc7811d06fa77c375b5e1', count=2} Test{field2='0x780aadc08c294da46e174fa287172038bba7afacf2dff41fdf0f6def03906e60', count=1} Test{field2='0x182d31bd491527e1e93c4e44686057207ee90c6a8428308a2bd7b6a4d2e10e53', count=1} Test{field2='0x182d31bd491527e1e93c4e44686057207ee90c6a8428308a2bd7b6a4d2e10e53', count=1}

How can I fix this without using BatchEnvironment ?

score 3 · Accepted Answer · answered Aug 31 '18 at 09:29

3

A GROUP BY query that runs on a stream must produce updates. Consider the following example:

SELECT user, COUNT(*) FROM clicks GROUP BY user;

Every time, the clicks table receives a new row, the count of the respective user needs to be incremented and updated.

When you convert a Table into a DataStream, these updates must be encoded in the stream. Flink uses retraction and add messages to do that. By calling tEnv.toRetractStream(table, Row.class), you convert the Table table into a DataStream<Tuple2<Boolean, Row>. The Boolean flag is important and indicates whether the Row is added or retracted from the result table.

Given the example query above and the input table clicks as

user | ...
------------
Bob  | ...
Liz  | ...
Bob  | ...

You will receive the following retraction stream

(+, (Bob, 1)) // add first result for Bob
(+, (Liz, 1)) // add first result for Liz
(-, (Bob, 1)) // remove outdated result for Bob
(+, (Bob, 2)) // add updated result for Bob

You need to actively maintain the result yourself and add and remove rows as instructed by the Boolean flag of the retraction stream.

answered Aug 31 '18 at 09:29

Fabian Hueske

18,707
2
44
49

Thanks for the reply it really helped me. I understand now the boolean field. But how can I get only the last result of the rows ? If I filter the rows which have false values I still get the updated values .... I can't store the result somewhere to update it after it takes too much time... – Gatsby Aug 31 '18 at 12:51
I'm afraid, you have to maintain the updates yourself. It's a characteristic of streaming queries that they need to update their results. If you have a bounded stream, you can append a `ProcessFunction` that keeps the last result per key and emits it when the job terminated. – Fabian Hueske Aug 31 '18 at 13:00
I have a bounded stream so a ProcessFunction should be the solution. Can you help me ? Except this example : https://ci.apache.org/projects/flinkubuntu/flink-docs-stable/dev/stream/operators/process_function.html#example, I didn't find another examples. – Gatsby Sep 02 '18 at 09:26
I can give a you a few pointers. keyBy the result stream on the grouping key, and add keyed state that stores the version of the row per key. Also set for each key an event-time timer to `Long.MAX_VALUE - 1` that will be called when the stream terminated. In the `onTimer` method you emit all result. **But**, I would actually recommend to process in batch mode. Streaming does not provide any benefit if you do not need intermediate results and it is much more expensive. – Fabian Hueske Sep 02 '18 at 14:23

Flink SQL: Repeating grouping keys in result of GROUP BY query

1 Answers1