1

We have a requirement of parsing a very complex json (size around 25 kb per event) event with a predefined schema (nested schema, with multiple schema files ) and create a temporary table and from temp table we have to apply some case statement based some fields( eg. to find out success, failure count , status code ) and do a aggregation in 1 sec interval.

We have tried with inbuilt JSON_VALUE function to retrieve some field value and then apply the case statement, but as I am using JSON_VALUE more than 5/6 times, the application is performing very slow.

For some other filtering use case we are able to receive more that 1600 event/sec, but for this case we are only receiving around 300 event/sec for 1 core .

Below is the query example:

Query 1:

select JSON_VALUE(message, '$.eventRecordHeader.Result' RETURNING INT) AS `result1`, JSON_VALUE(message, '$.eventRecordHeader.Cause.ErrorCode' ) AS errorCode, JSON_VALUE(message, '$.eventRecordHeader.Cause.SubCause' ) AS subCause, JSON_VALUE(message, '$.eventRecordHeader.Cause.SubCause.SubProtocol' ) AS subProtocol, JSON_VALUE(message, '$.eventRecordHeader.Cause.SubCause.SubError' ) AS subError, TO_TIMESTAMP_LTZ(cast(JSON_VALUE(message, '$.eventRecordHeader.StartTime') as bigint)/1000, 3) AS eventTime, proctime() as proctime from kafkaJsonSource;

Query 2:

select count(case when result1=1 then 1 else null end) failed_result,count(case when result1=0 then 1 else null end) successful_result,count(case when errorCode like '4%' then 1 else null end) err_starts_4,count(case when errorCode like '5%' then 1 else null end) err_starts_5,count(case when errorCode like '6%' then 1 else null end) err_starts_6,count(case when subCause is not null then 1 else null end) has_sub_cause,count(case when subProtocol='DNS' then 1 else null end) protocol_dns, count(case when subProtocol='Diameter' then 1 else null end) protocol_diameter, count(case when (subProtocol='Diameter' and subError like '3%') then 1 else null end) protocol_diameter_err_starts_3,count(case when (subProtocol='Diameter' and subError like '4%')  then 1 else null end) protocol_diameter_err_starts_4,count(case when (subProtocol='Diameter' and  subError like '5%')  then 1 else null end) protocol_diameter_err_starts_5 FROM TABLE(TUMBLE(TABLE filter_transformed, DESCRIPTOR(proctime), INTERVAL '1' SECOND)) GROUP BY window_start, window_end;
Soumen C
  • 136
  • 3
  • You could try to define UDF to have more control over JSON parsing https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/table/functions/udfs/ – bzu Jul 24 '22 at 09:42

0 Answers0