I created ksql stream using create stream as select
, and for some reason, stream's CSAS persistent streaming query produces 4 duplicate records, for each source record. How can I avoid duplicates? What is wrong with my setup?
Here is my setup:
- A stream from an underlying Kafka topic:
CREATE STREAM ORDERS ( ... ) WITH (
KAFKA_TOPIC='orders.prod',
VALUE_FORMAT='json'
);
This stream looks good - selecting by key, returns one record:
SELECT * FROM ORDERS WHERE ROWKEY = 'order-123'
1553124285000 | order-123 | ... | ... | ...
- Rekeyed stream:
CREATE STREAM ORDERS_REKEYED WITH (PARTITIONS=6, REPLICAS=2)
AS SELECT * FROM ORDERS PARTITION BY LEGACY_ID;
Now, when querying rekeyed stream, I see 4 identical records:
SELECT * FROM ORDERS_REKEYED WHERE ROWKEY = 'abc'
1553124285000 | abc | order-123 | ... | ... | ...
1553124285000 | abc | order-123 | ... | ... | ...
1553124285000 | abc | order-123 | ... | ... | ...
1553124285000 | abc | order-123 | ... | ... | ...
That's not what i'm expecting. I started looking at running queries using show queries;
and found that each node runs my query with different postfix number, e.g. node 1 runs CSAS_ORDERS_REKEYED_16, node 2 runs CSAS_ORDERS_REKEYED_21 ... here is full log of running queries by node:
- node 1: CSAS_ORDERS_REKEYED_16
- node 2: CSAS_ORDERS_REKEYED_21
- node 3: CSAS_ORDERS_REKEYED_15
- node 4: CSAS_ORDERS_REKEYED_21
- node 5: CSAS_ORDERS_REKEYED_16
- node 6: CSAS_ORDERS_REKEYED_18
I don't understand why do I have 4 queries (16, 21, 15, 18) across 6 nodes? Could this be a reason of having 4 identical output records per each input record?
Should I have only one unique query across all nodes? Or should every node run query with it's own postfix number?