I was testing the exactly once semantics on ksqldb server by very un-graceful shutdown of docker running process or letting the docker container to run out of memory. In both cases I receive duplicates which definitely is not the guaranteed behaviour. I feel like I might be missing the obvious here ...
The docker container has KSQL_KSQL_STREAMS_PROCESSING_GUARANTEE=exactly_once
parameter set. As far as I understand this will set the underlying producer setting for enable.idempotence
and consumers isolation.level
property.
And still the duplicates appear as a result of following queries: here
create or replace table TEST with (kafka_topic = 'TEST', value_format='avro',partitions=10, replicas=1)
AS
SELECT
CUSTOMERS_ID,
earliest_by_offset(LDTS) AS LDTS,
COLLECT_SET(NAMES) AS NAMES,
earliest_by_offset(CUSTOMER_PK) AS CUSTOMER_PK
from TEST_1
group by CUSTOMERS_PK
emit changes;
and also here
create or replace stream TEST_STREAM (CUSTOMERS_ID VARCHAR KEY, LDTS BIGINT, NAMES ARRAY<VARCHAR>, CUSTOMER_PK VARCHAR)
WITH
(KAFKA_TOPIC='TEST', KEY_FORMAT='KAFKA', VALUE_FORMAT='AVRO');
create or replace stream TEST_FINAL (KAFKA_KEY VARCHAR KEY, CUSTOMERS_ID VARCHAR, LDTS BIGINT,NAME VARCHAR, CUSTOMER_PK VARCHAR) WITH
(KAFKA_TOPIC='TEST_FINAL', VALUE_FORMAT='AVRO', partitions=10, replicas=1);
INSERT INTO
TEST_FINAL
select
CUSTOMERS_ID as KAFKA_KEY,
AS_VALUE(CUSTOMERS_ID) as CUSTOMERS_ID,
LDTS,
NAMES[1] as NAME,
CUSTOMER_PK
from TEST_STREAM
where
rowtime= LDTS and ARRAY_LENGTH(NAMES)=1;
You can ignore the logic of sql. These are just examples to make the question meatier. The point is that offset is obviously being lost during the container crash.
What else can I do ? Any properties I am missing ?
I am using kafka broker from confluent community v6.2.1 and ksqldb v0.21