1

I have a scenario where we have two different streams and we get data on them at two different times and i need to join them on the basis of the timestamp that is there in the value. I will try to explain through below example.

inputStream1 ->

  • key 111, value 21:00 AAA
  • key 111, value 21:02 AAA
  • key 111, value 21:04 AAA
  • key 111, value 21:15 AAA
  • key 111, value 21:18 BBB
  • key 111, value 21:20 BBB

inputStream2 ->

  • key 111, value 21:01 10.0.0.1
  • key 111, value 21:04 10.0.0.2
  • key 111, value 21:14 10.0.0.3
  • key 111, value 21:20 10.0.0.4
  • key 111, value 21:21 10.0.0.5

output Join that i need ->

  • AAA 10.0.0.1
  • AAA 10.0.0.2
  • AAA 10.0.0.3
  • BBB 10.0.0.4
  • BBB 10.0.0.5

Note: Both the stream get their input at different times. It is possible that when first record at inputStream1 arrives, inputStream2 has all 5 records present. I want to match them on the time window that is in the value.

How can i achieve this in kafka. Is it even possible?

Click to see the image

Shashank
  • 11
  • 3
  • Do you have any knowledge about the time difference? Is one stream always ahead of the other? Do you want to join each event to exactly one record of the other stream? -- In general it sound like a windowed stream-stream join, but not sure if this operator will really fit your use case. – Matthias J. Sax Jan 19 '21 at 00:22

1 Answers1

1

It would be very difficult to nearly impossible with just Kafka. Theoretically you could have a singleton server that read from both queues and did the correlation by reading just enough from each queue so that it had the records in memory that matched with each other.

If you are only looking to correlate data within given windows of time, some of the Kafka client libraries will read messages during a window of time so you may be able to use that. However from your data example it looks like that may or may not be the case unless your timestamps are merely when the message arrives.

However this would all fall apart as soon as you needed to scale to a second instance unless both of the queues were partitioned the same way.

Alternatively you write to an intermediate data store to store the data from the queues and do lookups against that data.

Ken Rabe
  • 149
  • 3