Use Case
Goal is to identify the incoming events/rows to check if it is a new row or an update. New Row will go to a different topic and update row will go to a different topic.
Approach: Have a lookup table (KTABLE) and do two join operations 1. Inner join to detect an update. 2. Left join when the right table key is null to detect an Insert/New row. Create two streams from the result of above two operations. Run an insert into query to a stream, which will insert records into the lookup table.
Steps to replicate the issue is here: (Takes 7 minutes)
Step 1.
docker-compose up
Latest Confluent 5.1.0 platform with docker.
Step 2. docker ps
Note: Make sure the broker is up.Broker was frequently going down in my local.
Step 3.
Get into the bash of the schema registry in a new terminal.( It's easy to monitor if you keep this terminal open.)
docker run -it --net=cp-all-in-one_default --rm confluentinc/cp-schema-registry:5.1.0 bash
Step 4.
Create a lookup table. With a topic LOAD.TEST.LOCAL.LOOKUP.TABLE. My schema has key of type String. Three sample records are below. You are populating the lookup table first with initial 3 dummy records.
kafka-avro-console-producer --broker-list broker:9092 --topic LOAD.TEST.LOCAL.LOOKUP.TABLE \
--property schema.registry.url=http://schema-registry:8081 \
--property parse.key=true \
--property key.separator=, \
--property key.schema='{"type":"string"}' \
--property value.schema='{"name":"LOAD.TEST.LOCAL.LOOKUP.TABLE","type":"record","namespace":"example.sender.batch","fields":[{"name":"SENDER_CODE","type":"string"},{"name":"SENDER_NAME","type":"string"},{"name":"SENDER_CATEGORY_CODE","type":"string"},{"name":"SENDER_AGENCY_CODE","type":"string"},{"name":"SENDER_SUB_AGENCY_CODE","type":"string"},{"name":"SENDER_FOREIGN_IND","type":"string"},{"name":"SENDER_FOREIGN_COUNTRY","type":"string"},{"name":"SENDER_NAME_ALTERNATE","type":"string"},{"name":"PARENT_SENDER_CODE","type":"string"},{"name":"CHANGE_DATE","type":"string"},{"name":"REQUESTING_LOCATION","type":"string"},{"name":"REQUEST_DATE","type":"string"},{"name":"REPLACEMENT_SENDER_CODE","type":"string"},{"name":"SENDER_STATUS","type":"string"},{"name":"SENDER_DUNS","type":"string"},{"name":"ADDRESSLINE1","type":"string"},{"name":"ADDRESSLINE2","type":"string"},{"name":"ADDRESSLINE3","type":"string"},{"name":"ADDRESS4","type":"string"},{"name":"CITY","type":"string"},{"name":"STATE","type":"string"},{"name":"POSTAL_CODE","type":"string"},{"name":"URL","type":"string"},{"name":"SENDER_ACRONYM","type":"string"},{"name":"DEACTIVATED_DATE","type":"string"},{"name":"Kafka_TimeEvent","type":"string"}]}'
Now you can insert records below. Just paste the 3 records below.
If you press return multiple times and get an exception, just run the same above command once more and insert these after pressing return
once.
"SVI6FQ",{"SENDER_CODE":"SVI6FQ","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FR",{"SENDER_CODE":"SVI6FR","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"374 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 973","ADDRESSLINE3":"MAILBOXC","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FN",{"SENDER_CODE":"SVI6FN","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"372 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXA","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
press ⌘+c
to exit.
Step 5. On a different terminal, open a KSQL CLI
docker run --network cp-all-in-one_default --interactive --tty --rm confluentinc/cp-ksql-cli:latest http://ksql-server:8088
Step 6. Create a KTABLE.
create table load_test_local_lookup_table with (KAFKA_TOPIC='LOAD.TEST.LOCAL.LOOKUP.TABLE',VALUE_FORMAT='AVRO',KEY='SENDER_CODE');
Step 7. Make sure to set the below property so that you can see the results from the begining offset. Run this in KSQL.
ksql> SET 'auto.offset.reset'='earliest';
You will see the following message.
Successfully changed local property 'auto.offset.reset' from 'null' to 'earliest'
Step 8. Now create a topic where your events will stream. Use the bash of schema registry of Step 4. Also,Populate the same records to underlying topic.
kafka-avro-console-producer --broker-list broker:9092 --topic LOAD.TEST.LOCAL.EVENT.STREAM \
--property schema.registry.url=http://schema-registry:8081 \
--property parse.key=true \
--property key.separator=, \
--property key.schema='{"type":"string"}' \
--property value.schema='{"name":"LOAD.TEST.LOCAL.EVENT.STREAM","type":"record","namespace":"example.sender.batch","fields":[{"name":"SENDER_CODE","type":"string"},{"name":"SENDER_NAME","type":"string"},{"name":"SENDER_CATEGORY_CODE","type":"string"},{"name":"SENDER_AGENCY_CODE","type":"string"},{"name":"SENDER_SUB_AGENCY_CODE","type":"string"},{"name":"SENDER_FOREIGN_IND","type":"string"},{"name":"SENDER_FOREIGN_COUNTRY","type":"string"},{"name":"SENDER_NAME_ALTERNATE","type":"string"},{"name":"PARENT_SENDER_CODE","type":"string"},{"name":"CHANGE_DATE","type":"string"},{"name":"REQUESTING_LOCATION","type":"string"},{"name":"REQUEST_DATE","type":"string"},{"name":"REPLACEMENT_SENDER_CODE","type":"string"},{"name":"SENDER_STATUS","type":"string"},{"name":"SENDER_DUNS","type":"string"},{"name":"ADDRESSLINE1","type":"string"},{"name":"ADDRESSLINE2","type":"string"},{"name":"ADDRESSLINE3","type":"string"},{"name":"ADDRESS4","type":"string"},{"name":"CITY","type":"string"},{"name":"STATE","type":"string"},{"name":"POSTAL_CODE","type":"string"},{"name":"URL","type":"string"},{"name":"SENDER_ACRONYM","type":"string"},{"name":"DEACTIVATED_DATE","type":"string"},{"name":"Kafka_TimeEvent","type":"string"}]}'
"SVI6FQ",{"SENDER_CODE":"SVI6FQ","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FR",{"SENDER_CODE":"SVI6FR","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"374 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 973","ADDRESSLINE3":"MAILBOXC","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FN",{"SENDER_CODE":"SVI6FN","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"372 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXA","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
Step 9.
Create stream for this event topic.
create stream load_test_local_event_stream with (KAFKA_TOPIC='LOAD.TEST.LOCAL.EVENT.STREAM',VALUE_FORMAT='AVRO',KEY='SENDER_CODE');
Step 10.
Derive a stream after Stream-Table Join which will be used to detect an already existing row. We call this as update_stream. Topic created for this stream below,will have only updates. This is one of my use case. I have to filter the message which are updates.
create stream load_test_update_stream as select event.* FROM load_test_local_event_stream event JOIN load_test_local_lookup_table lookup ON event.sender_code = lookup.sender_Code;
Step 11.
Create a stream with the topic which is used for lookup table. So that if you want to update the lookup table, you can insert into this stream. (If I am not wrong: You cannot directly insert to a ktable from a stream.). So doing this.
create stream load_test_lookup_feed_stream with (KAFKA_TOPIC='LOAD.TEST.LOCAL.LOOKUP.TABLE',VALUE_FORMAT='AVRO',KEY='SENDER_CODE');
Step 12. Run an insert into query. This query will insert into the look up table feed stream, which will update the lookup table when a message is available in update_stream.
Insert into load_test_lookup_feed_stream select EVENT_SENDER_CODE AS SENDER_CODE, EVENT_SENDER_NAME AS SENDER_NAME, EVENT_SENDER_CATEGORY_CODE AS SENDER_CATEGORY_CODE , EVENT_SENDER_AGENCY_CODE AS SENDER_AGENCY_CODE , EVENT_SENDER_SUB_AGENCY_CODE AS SENDER_SUB_AGENCY_CODE, EVENT_SENDER_FOREIGN_IND AS SENDER_FOREIGN_IND, EVENT_SENDER_FOREIGN_COUNTRY AS SENDER_FOREIGN_COUNTRY , EVENT_SENDER_NAME_ALTERNATE AS SENDER_NAME_ALTERNATE, EVENT_PARENT_SENDER_CODE AS PARENT_SENDER_CODE ,EVENT_CHANGE_DATE AS CHANGE_DATE, EVENT_REQUESTING_LOCATION AS REQUESTING_LOCATION , EVENT_REQUEST_DATE AS REQUEST_DATE, EVENT_REPLACEMENT_SENDER_CODE AS REPLACEMENT_SENDER_CODE , EVENT_SENDER_STATUS AS SENDER_STATUS, EVENT_SENDER_DUNS AS SENDER_DUNS , EVENT_ADDRESSLINE1 AS ADDRESSLINE1 , EVENT_ADDRESSLINE2 AS ADDRESSLINE2, EVENT_ADDRESSLINE3 AS ADDRESSLINE3 , EVENT_ADDRESS4 AS ADDRESS4 , EVENT_CITY AS CITY , EVENT_STATE AS STATE, EVENT_POSTAL_CODE AS POSTAL_CODE, EVENT_URL AS URL, EVENT_SENDER_ACRONYM AS SENDER_ACRONYM , EVENT_DEACTIVATED_DATE AS DEACTIVATED_DATE, EVENT_KAFKA_TIMEEVENT AS KAFKA_TIMEEVENT from load_test_update_stream partition by SENDER_CODE ;
:exclamation: Issue: Though this updastes my lookup table, this is getting updated as a new record. Not as an update. to replicate this issue ( Follow the steps 15A).
Step 13. Very similar to update, create a stream which will detect new records in the event.
create stream load_test_insert_stream as select event.* FROM load_test_local_event_stream event left JOIN load_test_local_lookup_table lookup ON event.sender_code = lookup.sender_Code where lookup.sender_Code is null ;
Verification: You can optionally run only the select query to understand what is happening. If your schema-registry bash is open, insert a new record with your own key. ( Try inserting a new record as in Section 15A). This new message will be available in this stream.
Step 14. Create an insert into query like before.This is inserting back to lookup table. Now your lookup table is populated with a new message.
Insert into load_test_lookup_feed_stream select EVENT_SENDER_CODE AS SENDER_CODE, EVENT_SENDER_NAME AS SENDER_NAME, EVENT_SENDER_CATEGORY_CODE AS SENDER_CATEGORY_CODE , EVENT_SENDER_AGENCY_CODE AS SENDER_AGENCY_CODE , EVENT_SENDER_SUB_AGENCY_CODE AS SENDER_SUB_AGENCY_CODE, EVENT_SENDER_FOREIGN_IND AS SENDER_FOREIGN_IND, EVENT_SENDER_FOREIGN_COUNTRY AS SENDER_FOREIGN_COUNTRY , EVENT_SENDER_NAME_ALTERNATE AS SENDER_NAME_ALTERNATE, EVENT_PARENT_SENDER_CODE AS PARENT_SENDER_CODE ,EVENT_CHANGE_DATE AS CHANGE_DATE, EVENT_REQUESTING_LOCATION AS REQUESTING_LOCATION , EVENT_REQUEST_DATE AS REQUEST_DATE, EVENT_REPLACEMENT_SENDER_CODE AS REPLACEMENT_SENDER_CODE , EVENT_SENDER_STATUS AS SENDER_STATUS, EVENT_SENDER_DUNS AS SENDER_DUNS , EVENT_ADDRESSLINE1 AS ADDRESSLINE1 , EVENT_ADDRESSLINE2 AS ADDRESSLINE2, EVENT_ADDRESSLINE3 AS ADDRESSLINE3 , EVENT_ADDRESS4 AS ADDRESS4 , EVENT_CITY AS CITY , EVENT_STATE AS STATE, EVENT_POSTAL_CODE AS POSTAL_CODE, EVENT_URL AS URL, EVENT_SENDER_ACRONYM AS SENDER_ACRONYM , EVENT_DEACTIVATED_DATE AS DEACTIVATED_DATE, EVENT_KAFKA_TIMEEVENT AS KAFKA_TIMEEVENT from load_test_insert_stream partition by SENDER_CODE ;
Step 15.
What is the issue : How to replicate.
Step 15A. How to insert a new sample record
Run the command in Step 8 (with the schema). Insert/paste the new record like below. Notice, I have changed the both Message key and the sender code. Always your message key and the row key should match.
for example : "SVI6FW","SENDER_CODE":"SVI6FW
"SVI6FW",{"SENDER_CODE":"SVI6FW","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
Step 15B. How to update a sample record
Very similar to the previous (15A) new record insertion, But use the same message key and just update name or some value. For example, 'SAM II'
became 'SAM III'
"SVI6FW",{"SENDER_CODE":"SVI6FW","SENDER_NAME":"SENDER SAM III","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
Issue If you see, My lookup table is not getting updated, It is treating each message as a new message even it is sent with the same key. Because of this, I am not able to detect updates. Every message is going as new message .
You can test by doing following actions.
- Send a new message with your own key (15A) . It will be available in the load_test_insert_stream.
- Send an updated message with the same key like 15B. It should be available in load_test_update_stream but it is going to load_test_insert_stream. and Lookup table is treating it as a new message.
Any new approach / suggestions are welcome!