2

I'm designing the new generation of an analysis system which needs to process many events from many sensors in near-real time. And to do that I want to use one of the Big Data Analytics platforms such as Hadoop, Spark Streaming or Flink.

In order to analyze each event I need to use some meta-data from a table (DB) or at-least load it into a cached map.

The problem is that each mapper is going to be parallelized on several nodes.

So I have two things to handle:

  • First, how to load/pass a HashMap to a mapper?
  • Is there any way to keep the HashMap Consistent between the mappers?
Gal Dreiman
  • 3,969
  • 2
  • 21
  • 40
  • 1
    DistributedCache is what you're looking for - http://stackoverflow.com/questions/21239722/hadoop-distributedcache-is-deprecated-what-is-the-preferred-api. You can make metadata files available to all Mappers. – Ben Watson Feb 16 '17 at 11:09
  • You could also build the hashmap in the `setup()` method of each mapper by querying your DB perhaps. – Binary Nerd Feb 16 '17 at 12:40
  • Will you be using Hadoop or Spark ? The solution can be quite different. Also, do you need every mapper to have the same info or is it specific to each mapper ? – A.Perrot Feb 16 '17 at 15:07

1 Answers1

0

Serialize HashMap structure to file, store it in HDFS and at the MapReduce job configuration phase use DistributedCache to spread file with serialized HashMap across all the mappers. Then at map phase each mapper can read the file, de-serialize and then access this HashMap.

Denis
  • 148
  • 4