Kafka Data Stream ID

Question

I am new to Kafka and trying to build a pipeline for my apache httpd logs to mongodb.

I have data produced from Filebeat with Kafka Output. I am then using Kstreams to read from the topic and mapValues the data and stream out to a different topic. The data is then to be sinked out using Kafka Connect to a database (MongoDB). Unfortunately my data from Filebeat does not come with an ID.

How can I create IDs for them as I would like to create a unique ID and insert it into the document before sinking it to mongodb? I am hoping this can happen in the mapValues transformation;

What kind of ID do you need? Wouldn't a combination or hashing of hostname/ip & filename+file modtime be enough? — OneCricketeer, Feb 07 '19 at 22:48
That might be enough actually. I feel like there is a chance it isn't technically always unique, but almost always it is. For my use case and just to get the ball rolling I am going to try this. — Sam Ulloa, Feb 08 '19 at 00:54

score 1 · Accepted Answer · answered Feb 07 '19 at 19:18

1

I think you could use a combination of partition and offset to create a unique id per message. You might want to add topic if you want to make it unique across topics.

answered Feb 07 '19 at 19:18

Arne Saupe

111
5

I could not access the partition or offset or topic inside the KStream object. – Sam Ulloa Feb 08 '19 at 00:54
1

I will admit that I am not a Kafka Stream expert but maybe this will help - https://stackoverflow.com/questions/40807346/how-can-i-get-the-offset-value-in-kstream – Arne Saupe Feb 08 '19 at 01:03
1

You would need to use `transform()` instead of `mapValue()` -- the provided `context` object from `init()` method, allows you to access topic, partition, and offset for each input record. – Matthias J. Sax Feb 08 '19 at 01:04

Kafka Data Stream ID

1 Answers1