2

I am reading data from 2 kafka topics. Which can be described as:

Topic1 data content: VehicleRegistrationNo, Timestamp, Location Topic2 data content: VehicleRegistrationNo, Timestamp, Speed

I need to merge these 2 messages based on nearest timestamp in both and output tuple as message VehicleRegistrationNo, Timestamp, Speed, Location. I am reading these topic via 2 spouts S1 and S2. Then bolt MergeS1andS2 takes input from these spouts and works as:

if (message from S1): save present message from S1 along with 2 previous messages (3 consecutive locations) to LocationHashMap elseif (message from S2): get locations details from LocationHashmap and merge speed for same Vehicle with location info, then send details to next bolt as tuple

I know HashMap is not efficient way of storing data in multinode. So I read about Trident and Redis to store intermediate data. What should I use to store my intermediate data in this senario which can work in distributed topology.

Swati
  • 535
  • 11
  • 25

1 Answers1

0

Any no-sql database will do the trick. Pick a key that uniquely identifies the tuple regardless from which topic is coming from. The logic would be something like:

  • Try to lookup the tuple from the database.
  • If tuple does not exist in database, store the tuple that you got from the topic into the database.
  • If tuple exists, merge contents of database tuple and topic tuple and store the resulting tuple back into the database (overwriting the contents of the previous tuple in the database)
David Espart
  • 11,520
  • 7
  • 36
  • 50