Why do I see duplicates in KSQL stream?

Question

I created ksql stream using create stream as select, and for some reason, stream's CSAS persistent streaming query produces 4 duplicate records, for each source record. How can I avoid duplicates? What is wrong with my setup?

Here is my setup:

A stream from an underlying Kafka topic:

CREATE STREAM ORDERS ( ... ) WITH (
  KAFKA_TOPIC='orders.prod',
  VALUE_FORMAT='json'
);

This stream looks good - selecting by key, returns one record:

SELECT * FROM ORDERS WHERE ROWKEY = 'order-123'

1553124285000 | order-123 | ... | ... | ...

Rekeyed stream:

CREATE STREAM ORDERS_REKEYED WITH (PARTITIONS=6, REPLICAS=2)
  AS SELECT * FROM ORDERS PARTITION BY LEGACY_ID;

Now, when querying rekeyed stream, I see 4 identical records:

SELECT * FROM ORDERS_REKEYED WHERE ROWKEY = 'abc'

1553124285000 | abc | order-123 | ... | ... | ...
1553124285000 | abc | order-123 | ... | ... | ...
1553124285000 | abc | order-123 | ... | ... | ...
1553124285000 | abc | order-123 | ... | ... | ...

That's not what i'm expecting. I started looking at running queries using show queries; and found that each node runs my query with different postfix number, e.g. node 1 runs CSAS_ORDERS_REKEYED_16, node 2 runs CSAS_ORDERS_REKEYED_21 ... here is full log of running queries by node:

node 1: CSAS_ORDERS_REKEYED_16
node 2: CSAS_ORDERS_REKEYED_21
node 3: CSAS_ORDERS_REKEYED_15
node 4: CSAS_ORDERS_REKEYED_21
node 5: CSAS_ORDERS_REKEYED_16
node 6: CSAS_ORDERS_REKEYED_18

I don't understand why do I have 4 queries (16, 21, 15, 18) across 6 nodes? Could this be a reason of having 4 identical output records per each input record?

Should I have only one unique query across all nodes? Or should every node run query with it's own postfix number?

Well, across the cluster of nodes, 1 query would do the job. You know, its in a cluster. — srikanth, Mar 21 '19 at 05:58
@srikanth let me clarify, i created a stream on one node, and it got populated on all 6 nodes automatically, since it's a cluster. — andrii, Mar 21 '19 at 14:42
Are they any news on this ?, i'm running on the same problem https://stackoverflow.com/questions/57770983/data-is-duplicated-when-i-create-a-flattened-stream?noredirect=1#comment101977125_57770983 — mohamedaymen benmoussa, Sep 03 '19 at 14:57

score 0 · Answer 1 · answered Dec 11 '19 at 11:06

I've raised a bug on github to track this issue as you are not the only person to have highlighted this issue: https://github.com/confluentinc/ksql/issues/4111

Could you please let us know what version of the CLI / Server you are using please?

Would you also be able to grab the server logs from around the time this happened? That could be super useful. Upload to the github issue please.

Why do I see duplicates in KSQL stream?

1 Answers1