2

I created ksql stream using create stream as select, and for some reason, stream's CSAS persistent streaming query produces 4 duplicate records, for each source record. How can I avoid duplicates? What is wrong with my setup?

Here is my setup:

  1. A stream from an underlying Kafka topic:
CREATE STREAM ORDERS ( ... ) WITH (
  KAFKA_TOPIC='orders.prod',
  VALUE_FORMAT='json'
);

This stream looks good - selecting by key, returns one record:

SELECT * FROM ORDERS WHERE ROWKEY = 'order-123'

1553124285000 | order-123 | ... | ... | ...
  1. Rekeyed stream:
CREATE STREAM ORDERS_REKEYED WITH (PARTITIONS=6, REPLICAS=2)
  AS SELECT * FROM ORDERS PARTITION BY LEGACY_ID;

Now, when querying rekeyed stream, I see 4 identical records:

SELECT * FROM ORDERS_REKEYED WHERE ROWKEY = 'abc'

1553124285000 | abc | order-123 | ... | ... | ...
1553124285000 | abc | order-123 | ... | ... | ...
1553124285000 | abc | order-123 | ... | ... | ...
1553124285000 | abc | order-123 | ... | ... | ...

That's not what i'm expecting. I started looking at running queries using show queries; and found that each node runs my query with different postfix number, e.g. node 1 runs CSAS_ORDERS_REKEYED_16, node 2 runs CSAS_ORDERS_REKEYED_21 ... here is full log of running queries by node:

  • node 1: CSAS_ORDERS_REKEYED_16
  • node 2: CSAS_ORDERS_REKEYED_21
  • node 3: CSAS_ORDERS_REKEYED_15
  • node 4: CSAS_ORDERS_REKEYED_21
  • node 5: CSAS_ORDERS_REKEYED_16
  • node 6: CSAS_ORDERS_REKEYED_18

I don't understand why do I have 4 queries (16, 21, 15, 18) across 6 nodes? Could this be a reason of having 4 identical output records per each input record?

Should I have only one unique query across all nodes? Or should every node run query with it's own postfix number?

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
andrii
  • 1,248
  • 1
  • 11
  • 25
  • Well, across the cluster of nodes, 1 query would do the job. You know, its in a cluster. – srikanth Mar 21 '19 at 05:58
  • @srikanth let me clarify, i created a stream on one node, and it got populated on all 6 nodes automatically, since it's a cluster. – andrii Mar 21 '19 at 14:42
  • Are they any news on this ?, i'm running on the same problem https://stackoverflow.com/questions/57770983/data-is-duplicated-when-i-create-a-flattened-stream?noredirect=1#comment101977125_57770983 – mohamedaymen benmoussa Sep 03 '19 at 14:57

1 Answers1

0

I've raised a bug on github to track this issue as you are not the only person to have highlighted this issue: https://github.com/confluentinc/ksql/issues/4111

Could you please let us know what version of the CLI / Server you are using please?

Would you also be able to grab the server logs from around the time this happened? That could be super useful. Upload to the github issue please.

Andrew Coates
  • 1,775
  • 1
  • 10
  • 16