0

I want to do a simple query in Flink SQL in one table which include a group by statement. But in the results there are duplicate rows for the column specified in the group by statement. Is that because I use a streaming environment and it doesn't remember the state ?

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
final StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
// configure Kafka consumer
Properties props = new Properties();
props.setProperty("bootstrap.servers", "localhost:9092"); // Broker default host:port
props.setProperty("group.id", "flink-consumer"); // Consumer group ID

FlinkKafkaConsumer011<BlocksTransactions> flinkBlocksTransactionsConsumer = new FlinkKafkaConsumer011<>(args[0], new BlocksTransactionsSchema(), props);
flinkBlocksTransactionsConsumer.setStartFromEarliest();

DataStream<BlocksTransactions> blocksTransactions = env.addSource(flinkBlocksTransactionsConsumer);


tableEnv.registerDataStream("blocksTransactionsTable", blocksTransactions);

Table sqlResult
        = tableEnv.sqlQuery(
                "SELECT block_hash, count(tx_hash) " +
                "FROM blocksTransactionsTable " +
                "GROUP BY block_hash");

DataStream<Test> resultStream = tableEnv
        .toRetractStream(sqlResult, Row.class)
        .map(t -> {
            Row r = t.f1;
            String field2 = r.getField(0).toString();
            long count = Long.valueOf(r.getField(1).toString());
            return new Test(field2, count);
        })
        .returns(Test.class);

resultStream.print();

resultStream.addSink(new FlinkKafkaProducer011<>("localhost:9092", "TargetTopic", new TestSchema()));

env.execute();

I use the group by statement for the block_hash column but I have several times the same block_hash. This is the result of the print() :

Test{field2='0x2c4a021d514e4f8f0beb8f0ce711652304928528487dc7811d06fa77c375b5e1', count=1} Test{field2='0x2c4a021d514e4f8f0beb8f0ce711652304928528487dc7811d06fa77c375b5e1', count=1} Test{field2='0x2c4a021d514e4f8f0beb8f0ce711652304928528487dc7811d06fa77c375b5e1', count=2} Test{field2='0x780aadc08c294da46e174fa287172038bba7afacf2dff41fdf0f6def03906e60', count=1} Test{field2='0x182d31bd491527e1e93c4e44686057207ee90c6a8428308a2bd7b6a4d2e10e53', count=1} Test{field2='0x182d31bd491527e1e93c4e44686057207ee90c6a8428308a2bd7b6a4d2e10e53', count=1}

How can I fix this without using BatchEnvironment ?

Fabian Hueske
  • 18,707
  • 2
  • 44
  • 49
Gatsby
  • 365
  • 1
  • 5
  • 17

1 Answers1

3

A GROUP BY query that runs on a stream must produce updates. Consider the following example:

SELECT user, COUNT(*) FROM clicks GROUP BY user;

Every time, the clicks table receives a new row, the count of the respective user needs to be incremented and updated.

When you convert a Table into a DataStream, these updates must be encoded in the stream. Flink uses retraction and add messages to do that. By calling tEnv.toRetractStream(table, Row.class), you convert the Table table into a DataStream<Tuple2<Boolean, Row>. The Boolean flag is important and indicates whether the Row is added or retracted from the result table.

Given the example query above and the input table clicks as

user | ...
------------
Bob  | ...
Liz  | ...
Bob  | ...

You will receive the following retraction stream

(+, (Bob, 1)) // add first result for Bob
(+, (Liz, 1)) // add first result for Liz
(-, (Bob, 1)) // remove outdated result for Bob
(+, (Bob, 2)) // add updated result for Bob

You need to actively maintain the result yourself and add and remove rows as instructed by the Boolean flag of the retraction stream.

Fabian Hueske
  • 18,707
  • 2
  • 44
  • 49
  • Thanks for the reply it really helped me. I understand now the boolean field. But how can I get only the last result of the rows ? If I filter the rows which have false values I still get the updated values .... I can't store the result somewhere to update it after it takes too much time... – Gatsby Aug 31 '18 at 12:51
  • I'm afraid, you have to maintain the updates yourself. It's a characteristic of streaming queries that they need to update their results. If you have a bounded stream, you can append a `ProcessFunction` that keeps the last result per key and emits it when the job terminated. – Fabian Hueske Aug 31 '18 at 13:00
  • I have a bounded stream so a ProcessFunction should be the solution. Can you help me ? Except this example : https://ci.apache.org/projects/flinkubuntu/flink-docs-stable/dev/stream/operators/process_function.html#example, I didn't find another examples. – Gatsby Sep 02 '18 at 09:26
  • I can give a you a few pointers. keyBy the result stream on the grouping key, and add keyed state that stores the version of the row per key. Also set for each key an event-time timer to `Long.MAX_VALUE - 1` that will be called when the stream terminated. In the `onTimer` method you emit all result. **But**, I would actually recommend to process in batch mode. Streaming does not provide any benefit if you do not need intermediate results and it is much more expensive. – Fabian Hueske Sep 02 '18 at 14:23