0

I am new to Kafka and trying to build a pipeline for my apache httpd logs to mongodb.

I have data produced from Filebeat with Kafka Output. I am then using Kstreams to read from the topic and mapValues the data and stream out to a different topic. The data is then to be sinked out using Kafka Connect to a database (MongoDB). Unfortunately my data from Filebeat does not come with an ID.

How can I create IDs for them as I would like to create a unique ID and insert it into the document before sinking it to mongodb? I am hoping this can happen in the mapValues transformation;

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Sam Ulloa
  • 186
  • 2
  • 12
  • What kind of ID do you need? Wouldn't a combination or hashing of hostname/ip & filename+file modtime be enough? – OneCricketeer Feb 07 '19 at 22:48
  • That might be enough actually. I feel like there is a chance it isn't technically always unique, but almost always it is. For my use case and just to get the ball rolling I am going to try this. – Sam Ulloa Feb 08 '19 at 00:54

1 Answers1

1

I think you could use a combination of partition and offset to create a unique id per message. You might want to add topic if you want to make it unique across topics.

Arne Saupe
  • 111
  • 5
  • I could not access the partition or offset or topic inside the KStream object. – Sam Ulloa Feb 08 '19 at 00:54
  • 1
    I will admit that I am not a Kafka Stream expert but maybe this will help - https://stackoverflow.com/questions/40807346/how-can-i-get-the-offset-value-in-kstream – Arne Saupe Feb 08 '19 at 01:03
  • 1
    You would need to use `transform()` instead of `mapValue()` -- the provided `context` object from `init()` method, allows you to access topic, partition, and offset for each input record. – Matthias J. Sax Feb 08 '19 at 01:04