1

Since some data are shared among all the map() functions, I can't generate them in the setup(), since each setup() corresponds to each map() function, and what I want to do is to pre-produce some data and store it somewhere achievable, then use it in each map().

How can I possibly do it? Say I am doing KNN with Mapreduce, and I want to use all of the test data by each map(). Where should I store those test data and then use them in the mapper?

Thank you so much.

xxx222
  • 2,980
  • 5
  • 34
  • 53
  • i think this is what you are looking for: http://stackoverflow.com/questions/21239722/hadoop-distributedcache-is-deprecated-what-is-the-preferred-api – vefthym Feb 21 '16 at 05:35

1 Answers1

0

You can store your pre-calculated data into HDFS and then include it in the Job's DitributedCache.

https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/filecache/DistributedCache.html

All files (data, libraries, etc) in the DistributedCache will be copied to each node before a task related with the job started on it.

The distributed cache is not limited to files in HDFS but the data need to be available from each node that require it (As HDFS it is).

RojoSam
  • 1,476
  • 12
  • 15