KSQL DB collect_list and collect_set does not support multiple columns, struct or map

Question

I am currently building a POC for my company on ConfluentCloud. The current version of KSQLDB does not yet support collect_list/collect_set on multiple columns, a struct or a map; therefore, I am trying to think of a workaround.

I am consuming SQL CDC streams, and trying to compose a nested object model with parent child relationships without the need to build a self hosted JAVA UDF or KSTREAM app.

Streams demo_games, demo_players and demo_teams should in the end yield the following model written to a kafka topic.

   {
     teamId: bigint,
     teamName: string,
     teamPlayers: [
         {
           playerid: bigint,
           playername:string
         }
     ],
     teamGames: [
         {
           gameid: bigint,
           gamename: string
         }
     ]
   }

Let's start with a smidge of code to illustrate what I am trying to achieve.

CREATE STREAM DEMO_GAMES( GAMEID BIGINT KEY, TEAMID BIGINT, GAMENAME STRING ) 
WITH (KAFKA_TOPIC='DEMO.GAMES',VALUE_FORMAT='JSON', PARTITIONS=1);

INSERT INTO DEMO_GAMES( GAMEID, TEAMID, GAMENAME) VALUES (1,1,'SUNDAY');
INSERT INTO DEMO_GAMES( GAMEID, TEAMID, GAMENAME) VALUES (2,1,'MONDAY');
INSERT INTO DEMO_GAMES( GAMEID, TEAMID, GAMENAME) VALUES (3,1,'FRIDAY');

CREATE STREAM DEMO_PLAYERS( PLAYERID BIGINT KEY, TEAMID BIGINT, PLAYERNAME STRING )
WITH (KAFKA_TOPIC='DEMO.PLAYERS',VALUE_FORMAT='JSON', PARTITIONS=1);

INSERT INTO DEMO_PLAYERS( PLAYERID, TEAMID, PLAYERNAME) VALUES (1,1,'PLAYER 1');
INSERT INTO DEMO_PLAYERS( PLAYERID, TEAMID, PLAYERNAME) VALUES (2,1,'PLAYER 2');
INSERT INTO DEMO_PLAYERS( PLAYERID, TEAMID, PLAYERNAME) VALUES (3,1,'PLAYER 3');
INSERT INTO DEMO_PLAYERS( PLAYERID, TEAMID, PLAYERNAME) VALUES (4,1,'PLAYER 4');

CREATE STREAM DEMO_TEAMS( TEAMID BIGINT KEY,TEAMNAME STRING )
WITH (KAFKA_TOPIC='DEMO.TEAMS',VALUE_FORMAT='JSON',  PARTITIONS=1);

INSERT INTO DEMO_TEAMS(  TEAMID, TEAMNAME) VALUES (1,'THE TEAM');

#create a few persistent queries...

create stream demo_team_players as
select
teamid,
playerid,
 struct(playerid:=playerid, 
        playername:=playername ) `model` 
  from  DEMO_PLAYERS emit changes;


create stream demo_team_games as
select
teamid,
gameid,
 struct(gameid:=gameid, 
        gamename:=gamename) `model` 
  from  DEMO_games emit changes;

The two persistent queries above wrap the data I want to include in collect_list into a struct. So now I can execute the following query.

select teamid, transform(collect_list( cast(`model` as string)), t=>t) as teamplayers from  DEMO_TEAM_GAMES  group by teamid emit changes;

#yields
{
  "TEAMID": 1,
  "TEAMPLAYERS": [
    "Struct{GAMEID=1,GAMENAME=SUNDAY}",
    "Struct{GAMEID=2,GAMENAME=MONDAY}",
    "Struct{GAMEID=3,GAMENAME=FRIDAY}"
  ]
}

My question is this. Is there a way to take a "serialized" STRUCT string and convert it back to a STRUCT within the transform lambda?

I also tried building a dynamic JSON string as a parameter to the collection_list function and then build a struct using EXTRACT_JSON_FIELD. This seems very brittle and to get the JSON string, I am forced to cast all values to string.

score 0 · Answer 1 · answered Apr 12 '22 at 13:33

also I am doing some POC for my current company, I got the same issue as you mentioned here. I think it's a critical feature for ksqldb, there is pull request for this feature in ksql repository and already merged. you can wait for the new version release or build the image yourself.

https://github.com/confluentinc/ksql/pull/8877

score 0 · Answer 2 · answered Apr 13 '22 at 14:27

I eventually realized grouping in KSQL is not the same as grouping in T-SQL. If you somehow managed to group the relationships, you would end up with ALL related data not just the most recent because you can only group on a stream. But I did manage to whip up this gem. You basically force AS_MAP to serialize and then replace a few delimiters to make valid JSON and then pass it off to SPLIT_TO_MAP

       set 'auto.offset.reset' = 'earliest';
create table EVENTTABLE
WITH( KAFKA_TOPIC='EventsGroupedByTeamTable', VALUE_FORMAT='AVRO', PARTITIONS=1) 
as 
select  HOMETEAMID, 
        ARRAY_LENGTH (COLLECT_SET(HOMETEAMID) ) AS COUNT,
        TRANSFORM(COLLECT_SET( CAST( AS_MAP(
        ARRAY[
                        'AWAYTEAMNAME',
                        'HOMETEAMNAME'
                      
        ],
        ARRAY[
                        CAST (AWAYTEAMNAME AS STRING),
                        CAST (HOMETEAMNAME AS STRING)
                     
        ]
        
        ) AS STRING ) ), P=>SPLIT_TO_MAP(REPLACE(REPLACE(P,'{',''),'}',''),',','=') ) AS TEAMSCHEDULES

from  EVENTSTREAM
GROUP BY HOMETEAMID
emit changes;

KSQL DB collect_list and collect_set does not support multiple columns, struct or map

2 Answers2