2

I am trying to create a stream in ksqlDB to get the data from the kafka topic and perform query on it.

CREATE STREAM test_location (
  id VARCHAR,
  name VARCHAR,
  location VARCHAR
  )

 WITH (KAFKA_TOPIC='public.location',
       VALUE_FORMAT='JSON',
       PARTITIONS=10);

The data in the topics public.location is in JSON format.

UPDATED topic message.

print 'public.location' from beginning limit 1;
Key format: ¯\_(ツ)_/¯ - no data processed
Value format: JSON or KAFKA_STRING
rowtime: 2021/05/23 11:27:39.429 Z, key: <null>, value: {"sourceTable":{"id":"1","name":Sam,"location":Manchester,"ConnectorVersion":null,"connectorId":null,"ConnectorName":null,"DbName":null,"DbSchema":null,"TableName":null,"payload":null,"schema":null},"ConnectorVersion":null,"connectorId":null,"ConnectorName":null,"DbName":null,"DbSchema":null,"TableName":null,"payload":null,"schema":null}, partition: 3

After the stream is created, and performing SELECT on the created stream I get NULL in the output. Although the topic has the data.

select * from test_location
>EMIT CHANGES limit 5;
+-----------------------------------------------------------------+-----------------------------------------------------------------+-----------------------------------------------------------------+
|ID                                                               |NAME                                                            |LOCATION                                                          |
+-----------------------------------------------------------------+-----------------------------------------------------------------+-----------------------------------------------------------------+
|null                                                             |null                                                             |null                                                             |
|null                                                             |null                                                             |null                                                             |
|null                                                             |null                                                             |null                                                             |
|null                                                             |null                                                             |null                                                             |
|null                                                             |null                                                             |null                                                             |
Limit Reached
Query terminated

Here is the details from docker file

version: '2'

services:

  ksqldb-server:
    image: confluentinc/ksqldb-server:0.18.0
    hostname: ksqldb-server
    container_name: ksqldb-server
    depends_on:
      - schema-registry
    ports:
      - "8088:8088"
    environment:
      KSQL_LISTENERS: "http://0.0.0.0:8088"
      KSQL_BOOTSTRAP_SERVERS: "broker:29092"
      KSQL_KSQL_SCHEMA_REGISTRY_URL: "http://schema-registry:8081"
      KSQL_KSQL_LOGGING_PROCESSING_STREAM_AUTO_CREATE: "true"
      KSQL_KSQL_LOGGING_PROCESSING_TOPIC_AUTO_CREATE: "true"
      # Configuration to embed Kafka Connect support.
      KSQL_CONNECT_GROUP_ID: "ksql-connect-01"
      KSQL_CONNECT_BOOTSTRAP_SERVERS: "broker:29092"
      KSQL_CONNECT_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      KSQL_CONNECT_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      KSQL_CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: "http://schema-registry:8081"
      KSQL_CONNECT_CONFIG_STORAGE_TOPIC: "_ksql-connect-01-configs"
      KSQL_CONNECT_OFFSET_STORAGE_TOPIC: "_ksql-connect-01-offsets"
      KSQL_CONNECT_STATUS_STORAGE_TOPIC: "_ksql-connect-01-statuses"
      KSQL_CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
      KSQL_CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
      KSQL_CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
      KSQL_CONNECT_PLUGIN_PATH: "/usr/share/kafka/plugins"

Update: Here is a message in the topic that I see in the Kafka

{
   "sourceTable": {
      "id": "1",
      "name": Sam,
      "location": Manchester,
      "ConnectorVersion": null,
      "connectorId": null,
      "ConnectorName": null,
      "DbName": null,
      "DbSchema": null,
      "TableName": null,
      "payload": null,
      "schema": null
   },
   "ConnectorVersion": null,
   "connectorId": null,
   "ConnectorName": null,
   "DbName": null,
   "DbSchema": null,
   "TableName": null,
   "payload": null,
   "schema": null
}

Which step or configuration I am missing?

DebianUser
  • 23
  • 4
  • Please show your actual topic data – OneCricketeer Jul 29 '21 at 17:28
  • @OneCricketeer Hello, I have updated the topic in ksqlDB topic and also added how it looks in the Kafka topic... null in the data is actually null fields... – DebianUser Aug 02 '21 at 08:41
  • This data came from debezium? Your schema+payload are completely null, so there are no id,name,location fields to obtain. Besides, it order to parse this data, you'd have needed to define a `payload` field in ksql or tell Debezium to extract the payload from the event – OneCricketeer Aug 02 '21 at 14:35
  • @OneCricketeer This data comes from the SCDF Processor into the Kafka. It is basically an ETL that also contain debezium in the starting as source connector, but specifically this topic has been created by processing the data into SCDF. – DebianUser Aug 03 '21 at 07:55
  • Okay, well, my comment is still valid. As you've shown, your message has no `id,name,location` at all, so you should expect to only see nulls in KSQL. And if you want to be able to parse that data, you need to set KSQL to have at least a `payload STRUCT` field – OneCricketeer Aug 03 '21 at 17:54

1 Answers1

3

Given your payload, you would need to declare the schema nested, because id, name, and location are not "top level" fields in the Json, but they are nested within sourceTable.

CREATE STREAM est_location (
  sourceTable STRUCT<id VARCHAR, name VARCHAR, location VARCHAR>
)

It's not possible to "unwrap" the data when defining the schema, but the schema must match what is in the topic. In addition to sourceTable you could also add ConnectorVersion etc to the schema, as they are also "top level" fields in you JSON. Bottom line is, that column in ksqlDB can only be declared on top level field. Everything else is nested data that you can access using STRUCT type.

Of course later, when you query est_location you can refer to individual fields via sourceTable->id etc.

It would also be possible to declare a derived STREAM if you want to unnest the schema:

CREATE STREAM unnested_est_location AS
  SELECT sourceTable->id AS id,
         sourceTable->name AS name,
         sourceTable->location AS location
  FROM est_location;

Of course, this would write the data into a new topic.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137