2

Use Case

Goal is to identify the incoming events/rows to check if it is a new row or an update. New Row will go to a different topic and update row will go to a different topic.

Approach: Have a lookup table (KTABLE) and do two join operations 1. Inner join to detect an update. 2. Left join when the right table key is null to detect an Insert/New row. Create two streams from the result of above two operations. Run an insert into query to a stream, which will insert records into the lookup table.

Steps to replicate the issue is here: (Takes 7 minutes)

Step 1. docker-compose up Latest Confluent 5.1.0 platform with docker.

Step 2. docker ps

Note: Make sure the broker is up.Broker was frequently going down in my local.

Step 3.

Get into the bash of the schema registry in a new terminal.( It's easy to monitor if you keep this terminal open.)

docker run -it --net=cp-all-in-one_default --rm confluentinc/cp-schema-registry:5.1.0 bash

Step 4.

Create a lookup table. With a topic LOAD.TEST.LOCAL.LOOKUP.TABLE. My schema has key of type String. Three sample records are below. You are populating the lookup table first with initial 3 dummy records.

kafka-avro-console-producer --broker-list broker:9092 --topic LOAD.TEST.LOCAL.LOOKUP.TABLE \
    --property schema.registry.url=http://schema-registry:8081 \
    --property parse.key=true \
    --property key.separator=, \
    --property key.schema='{"type":"string"}' \
    --property value.schema='{"name":"LOAD.TEST.LOCAL.LOOKUP.TABLE","type":"record","namespace":"example.sender.batch","fields":[{"name":"SENDER_CODE","type":"string"},{"name":"SENDER_NAME","type":"string"},{"name":"SENDER_CATEGORY_CODE","type":"string"},{"name":"SENDER_AGENCY_CODE","type":"string"},{"name":"SENDER_SUB_AGENCY_CODE","type":"string"},{"name":"SENDER_FOREIGN_IND","type":"string"},{"name":"SENDER_FOREIGN_COUNTRY","type":"string"},{"name":"SENDER_NAME_ALTERNATE","type":"string"},{"name":"PARENT_SENDER_CODE","type":"string"},{"name":"CHANGE_DATE","type":"string"},{"name":"REQUESTING_LOCATION","type":"string"},{"name":"REQUEST_DATE","type":"string"},{"name":"REPLACEMENT_SENDER_CODE","type":"string"},{"name":"SENDER_STATUS","type":"string"},{"name":"SENDER_DUNS","type":"string"},{"name":"ADDRESSLINE1","type":"string"},{"name":"ADDRESSLINE2","type":"string"},{"name":"ADDRESSLINE3","type":"string"},{"name":"ADDRESS4","type":"string"},{"name":"CITY","type":"string"},{"name":"STATE","type":"string"},{"name":"POSTAL_CODE","type":"string"},{"name":"URL","type":"string"},{"name":"SENDER_ACRONYM","type":"string"},{"name":"DEACTIVATED_DATE","type":"string"},{"name":"Kafka_TimeEvent","type":"string"}]}'

Now you can insert records below. Just paste the 3 records below. If you press return multiple times and get an exception, just run the same above command once more and insert these after pressing return once.

"SVI6FQ",{"SENDER_CODE":"SVI6FQ","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FR",{"SENDER_CODE":"SVI6FR","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"374 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 973","ADDRESSLINE3":"MAILBOXC","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FN",{"SENDER_CODE":"SVI6FN","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"372 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXA","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"} 

press ⌘+c to exit.

Step 5. On a different terminal, open a KSQL CLI

docker run --network cp-all-in-one_default --interactive --tty --rm confluentinc/cp-ksql-cli:latest http://ksql-server:8088

Step 6. Create a KTABLE.

create table load_test_local_lookup_table with (KAFKA_TOPIC='LOAD.TEST.LOCAL.LOOKUP.TABLE',VALUE_FORMAT='AVRO',KEY='SENDER_CODE');

Step 7. Make sure to set the below property so that you can see the results from the begining offset. Run this in KSQL.

ksql> SET 'auto.offset.reset'='earliest';

You will see the following message. Successfully changed local property 'auto.offset.reset' from 'null' to 'earliest'

Step 8. Now create a topic where your events will stream. Use the bash of schema registry of Step 4. Also,Populate the same records to underlying topic.

 kafka-avro-console-producer --broker-list broker:9092 --topic LOAD.TEST.LOCAL.EVENT.STREAM \
    --property schema.registry.url=http://schema-registry:8081 \
    --property parse.key=true \
    --property key.separator=, \
    --property key.schema='{"type":"string"}' \
    --property value.schema='{"name":"LOAD.TEST.LOCAL.EVENT.STREAM","type":"record","namespace":"example.sender.batch","fields":[{"name":"SENDER_CODE","type":"string"},{"name":"SENDER_NAME","type":"string"},{"name":"SENDER_CATEGORY_CODE","type":"string"},{"name":"SENDER_AGENCY_CODE","type":"string"},{"name":"SENDER_SUB_AGENCY_CODE","type":"string"},{"name":"SENDER_FOREIGN_IND","type":"string"},{"name":"SENDER_FOREIGN_COUNTRY","type":"string"},{"name":"SENDER_NAME_ALTERNATE","type":"string"},{"name":"PARENT_SENDER_CODE","type":"string"},{"name":"CHANGE_DATE","type":"string"},{"name":"REQUESTING_LOCATION","type":"string"},{"name":"REQUEST_DATE","type":"string"},{"name":"REPLACEMENT_SENDER_CODE","type":"string"},{"name":"SENDER_STATUS","type":"string"},{"name":"SENDER_DUNS","type":"string"},{"name":"ADDRESSLINE1","type":"string"},{"name":"ADDRESSLINE2","type":"string"},{"name":"ADDRESSLINE3","type":"string"},{"name":"ADDRESS4","type":"string"},{"name":"CITY","type":"string"},{"name":"STATE","type":"string"},{"name":"POSTAL_CODE","type":"string"},{"name":"URL","type":"string"},{"name":"SENDER_ACRONYM","type":"string"},{"name":"DEACTIVATED_DATE","type":"string"},{"name":"Kafka_TimeEvent","type":"string"}]}'

"SVI6FQ",{"SENDER_CODE":"SVI6FQ","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FR",{"SENDER_CODE":"SVI6FR","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"374 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 973","ADDRESSLINE3":"MAILBOXC","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FN",{"SENDER_CODE":"SVI6FN","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"372 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXA","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}

Step 9.

Create stream for this event topic.

create stream load_test_local_event_stream with (KAFKA_TOPIC='LOAD.TEST.LOCAL.EVENT.STREAM',VALUE_FORMAT='AVRO',KEY='SENDER_CODE');

Step 10.

Derive a stream after Stream-Table Join which will be used to detect an already existing row. We call this as update_stream. Topic created for this stream below,will have only updates. This is one of my use case. I have to filter the message which are updates.

create stream load_test_update_stream as select event.*  FROM load_test_local_event_stream event JOIN  load_test_local_lookup_table lookup ON event.sender_code = lookup.sender_Code;

Step 11.

Create a stream with the topic which is used for lookup table. So that if you want to update the lookup table, you can insert into this stream. (If I am not wrong: You cannot directly insert to a ktable from a stream.). So doing this.

create stream load_test_lookup_feed_stream with (KAFKA_TOPIC='LOAD.TEST.LOCAL.LOOKUP.TABLE',VALUE_FORMAT='AVRO',KEY='SENDER_CODE');

Step 12. Run an insert into query. This query will insert into the look up table feed stream, which will update the lookup table when a message is available in update_stream.

Insert into load_test_lookup_feed_stream  select EVENT_SENDER_CODE AS SENDER_CODE, EVENT_SENDER_NAME AS SENDER_NAME, EVENT_SENDER_CATEGORY_CODE AS SENDER_CATEGORY_CODE , EVENT_SENDER_AGENCY_CODE AS SENDER_AGENCY_CODE , EVENT_SENDER_SUB_AGENCY_CODE AS SENDER_SUB_AGENCY_CODE, EVENT_SENDER_FOREIGN_IND AS SENDER_FOREIGN_IND, EVENT_SENDER_FOREIGN_COUNTRY AS SENDER_FOREIGN_COUNTRY  , EVENT_SENDER_NAME_ALTERNATE AS SENDER_NAME_ALTERNATE, EVENT_PARENT_SENDER_CODE AS PARENT_SENDER_CODE ,EVENT_CHANGE_DATE AS CHANGE_DATE, EVENT_REQUESTING_LOCATION AS REQUESTING_LOCATION , EVENT_REQUEST_DATE AS REQUEST_DATE, EVENT_REPLACEMENT_SENDER_CODE AS REPLACEMENT_SENDER_CODE  , EVENT_SENDER_STATUS AS SENDER_STATUS, EVENT_SENDER_DUNS AS SENDER_DUNS , EVENT_ADDRESSLINE1 AS ADDRESSLINE1 , EVENT_ADDRESSLINE2 AS ADDRESSLINE2, EVENT_ADDRESSLINE3 AS ADDRESSLINE3 , EVENT_ADDRESS4 AS ADDRESS4  , EVENT_CITY AS CITY , EVENT_STATE AS STATE, EVENT_POSTAL_CODE AS POSTAL_CODE, EVENT_URL AS URL, EVENT_SENDER_ACRONYM AS SENDER_ACRONYM , EVENT_DEACTIVATED_DATE AS DEACTIVATED_DATE, EVENT_KAFKA_TIMEEVENT AS KAFKA_TIMEEVENT from load_test_update_stream partition by SENDER_CODE ;

:exclamation: Issue: Though this updastes my lookup table, this is getting updated as a new record. Not as an update. to replicate this issue ( Follow the steps 15A).

Step 13. Very similar to update, create a stream which will detect new records in the event.

create stream load_test_insert_stream as select event.*  FROM load_test_local_event_stream event left JOIN  load_test_local_lookup_table lookup ON event.sender_code = lookup.sender_Code where lookup.sender_Code is null ;

Verification: You can optionally run only the select query to understand what is happening. If your schema-registry bash is open, insert a new record with your own key. ( Try inserting a new record as in Section 15A). This new message will be available in this stream.

Step 14. Create an insert into query like before.This is inserting back to lookup table. Now your lookup table is populated with a new message.

Insert into load_test_lookup_feed_stream select EVENT_SENDER_CODE AS SENDER_CODE, EVENT_SENDER_NAME AS SENDER_NAME, EVENT_SENDER_CATEGORY_CODE AS SENDER_CATEGORY_CODE , EVENT_SENDER_AGENCY_CODE AS SENDER_AGENCY_CODE , EVENT_SENDER_SUB_AGENCY_CODE AS SENDER_SUB_AGENCY_CODE, EVENT_SENDER_FOREIGN_IND AS SENDER_FOREIGN_IND, EVENT_SENDER_FOREIGN_COUNTRY AS SENDER_FOREIGN_COUNTRY  , EVENT_SENDER_NAME_ALTERNATE AS SENDER_NAME_ALTERNATE, EVENT_PARENT_SENDER_CODE AS PARENT_SENDER_CODE ,EVENT_CHANGE_DATE AS CHANGE_DATE, EVENT_REQUESTING_LOCATION AS REQUESTING_LOCATION , EVENT_REQUEST_DATE AS REQUEST_DATE, EVENT_REPLACEMENT_SENDER_CODE AS REPLACEMENT_SENDER_CODE  , EVENT_SENDER_STATUS AS SENDER_STATUS, EVENT_SENDER_DUNS AS SENDER_DUNS , EVENT_ADDRESSLINE1 AS ADDRESSLINE1 , EVENT_ADDRESSLINE2 AS ADDRESSLINE2, EVENT_ADDRESSLINE3 AS ADDRESSLINE3 , EVENT_ADDRESS4 AS ADDRESS4  , EVENT_CITY AS CITY , EVENT_STATE AS STATE, EVENT_POSTAL_CODE AS POSTAL_CODE, EVENT_URL AS URL, EVENT_SENDER_ACRONYM AS SENDER_ACRONYM , EVENT_DEACTIVATED_DATE AS DEACTIVATED_DATE, EVENT_KAFKA_TIMEEVENT AS KAFKA_TIMEEVENT from load_test_insert_stream partition by SENDER_CODE ;

Step 15.

What is the issue : How to replicate.

Step 15A. How to insert a new sample record

Run the command in Step 8 (with the schema). Insert/paste the new record like below. Notice, I have changed the both Message key and the sender code. Always your message key and the row key should match. for example : "SVI6FW","SENDER_CODE":"SVI6FW

"SVI6FW",{"SENDER_CODE":"SVI6FW","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}

Step 15B. How to update a sample record

Very similar to the previous (15A) new record insertion, But use the same message key and just update name or some value. For example, 'SAM II' became 'SAM III'

"SVI6FW",{"SENDER_CODE":"SVI6FW","SENDER_NAME":"SENDER SAM III","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}

Issue If you see, My lookup table is not getting updated, It is treating each message as a new message even it is sent with the same key. Because of this, I am not able to detect updates. Every message is going as new message .

You can test by doing following actions.

  1. Send a new message with your own key (15A) . It will be available in the load_test_insert_stream.
  2. Send an updated message with the same key like 15B. It should be available in load_test_update_stream but it is going to load_test_insert_stream. and Lookup table is treating it as a new message.

Any new approach / suggestions are welcome!

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Krishnakanth Jc
  • 183
  • 1
  • 14
  • 1
    Just a suggestion, but I feel like something like this is better suited for a KSQL Github issue... In any case, I don't believe KSQL yet supports Avro keys in topics (even though you've defined it as string, it's still Avro) – OneCricketeer Dec 22 '18 at 06:39

1 Answers1

0

I'm assuming your design is basically running close to:

-- stream of inputs:
CREATE STREAM INPUT (ID INT KEY, V0 INT) WITH (kafka_topic='test_topic', value_format='JSON', PARTITIONS=1);

-- table built from the stream of inserts:
CREATE TABLE EXISTING (ID INT PRIMARY KEY, IGNORED INT) WITH (kafka_topic='INSERTS', value_format='JSON', PARTITIONS=1);

-- stream of inserts:
CREATE STREAM INSERTS AS SELECT INPUT.ID, INPUT.V0 AS V0 FROM INPUT LEFT JOIN EXISTING ON INPUT.ID = EXISTING.ID WHERE EXISTING.ID IS NULL;

-- stream of updates:
CREATE STREAM UPDATES AS SELECT INPUT.ID, INPUT.V0 AS V0 FROM INPUT JOIN EXISTING ON INPUT.ID = EXISTING.ID;

And then you're inserting some records:

INSERT INTO INPUT VALUES (1, 3);
INSERT INTO INPUT VALUES (2, 4);
INSERT INTO INPUT VALUES (1, 5);

And expect the first two rows to be output in the INSERTS stream, and the last row in the OUTPUTS row.

I've tested the above on version 0.11 of ksqlDB and it does work.... ish.

If you insert each record one by one from the CLI then the output is as you would expect. However, if you insert all three rows together, e.g. by running them all on the same line in the CLI:

INSERT INTO INPUT VALUES (1, 3);INSERT INTO INPUT VALUES (2, 4);INSERT INTO INPUT VALUES (1, 5);

Then all three rows end up in the INSERTS stream. Why? you may ask.

tl;dr; the solution is brittle. It won't work if updates come close to inserts.

There is a race condition in the design. If the join processes the second input row before the first rows output has been produced to the INSERTS topic and the join has polled to read this row, then the EXISTING table will not include the row, and so the second row will incorrectly be sent to INSERTS and not `UPDATES.

There are some configs you can play around with to see if you can get this working for your use-case.

  • Setting max.task.idle.ms higher will mean the join will wait longer for data to appear on the table side of the join. However, this won't help if updates and inserts happen in the same millisecond, and increasing it will hurt throughput and latency.
  • Setting cache.max.bytes.buffering to zero will turn off buffering in the Streams library, which may help.
  • Setting linger.ms to zero will mean the Kafka producer doesn't delay sending messages.

Even with all of this, the system is asynchronous and your results may vary. If updates never happen close to inserts, then the system would work. However, if updates can come close to inserts, you may find these are incorrectly categorised as inserts.

Andrew Coates
  • 1,775
  • 1
  • 10
  • 16