2

I have a process that feeds relatively simple vehicle data into a kafka topic. The records are keyd by registration and the values contain things like latitude/longitude etc + a value called DateTime which is a timestamp based on the sensor that took the readings (not the producer or the cluster).

My data arrives out of order in general and also especially if I keep on pumping the same test data set into the vehicle-update-log topic over and over. My data set contains two records for the vehicle I'm testing with.

My expectation is that when I do a select on the table, that it will return one row with the most recent data based on the ROWTIME of the records. (I've verified that the ROWTIME is getting set correctly.)

What happens instead is that the result has both rows (for the same primary KEY) and the last value is the oldest ROWTIME.

I'm confused; I thought ksql will keep the most recent update only. Must I now write additional logic on the client side to pick the latest of the data I get?

I created the table like this:

CREATE TABLE vehicle_updates 
(
Latitude DOUBLE, 
Longitude DOUBLE,
DateTime BIGINT, 
Registration STRING PRIMARY KEY
)
WITH 
(
KAFKA_TOPIC = 'vehicle-update-log', 
VALUE_FORMAT = 'JSON_SR', 
TIMESTAMP = 'DateTime'
);

Here is my query:

SELECT 
registration,
ROWTIME, 
TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS', 'Africa/Johannesburg') AS rowtime_formatted  
FROM vehicle_updates 
WHERE registration = 'BT66MVE'
EMIT CHANGES;

Results while no data is flowing:

+------------------------------+------------------------------+------------------------------+
|REGISTRATION                  |ROWTIME                       |ROWTIME_FORMATTED             |
+------------------------------+------------------------------+------------------------------+
|BT66MVE                       |1631532052000                 |2021-09-13 13:20:52.000       |
|BT66MVE                       |1631527147000                 |2021-09-13 11:59:07.000       |

Here's the same query, but I'm pumping the data set into the topic again while the query is running. I'm surprised to be getting the older record as updates.

Results while feeding data:

+------------------------------+------------------------------+------------------------------+
|REGISTRATION                  |ROWTIME                       |ROWTIME_FORMATTED             |
+------------------------------+------------------------------+------------------------------+
|BT66MVE                       |1631532052000                 |2021-09-13 13:20:52.000       |
|BT66MVE                       |1631527147000                 |2021-09-13 11:59:07.000       |
|BT66MVE                       |1631527147000                 |2021-09-13 11:59:07.000       |

What gives?

Pieter Breed
  • 5,579
  • 5
  • 44
  • 60

1 Answers1

3

In the end, it's a issue in Kafka Streams, that is not easy to resolve: https://issues.apache.org/jira/browse/KAFKA-10493 (we are working on some long term solution for it already though).

While event-time based processing is a central design pillar, there are some gaps that still needs to get closed.

The underlying issue is, that Kafka itself was originally designed based on log-append order only. Timestamps got added later (in 0.10 release). There are still some gaps today (eg, https://issues.apache.org/jira/browse/KAFKA-7061) in which "offset order" is dominant. You are hitting one of those cases.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137