Distributed Cache in Hadoop

Question

What is Distributed Cahce in Hadoop?

How it works?

Could some one give me inline description of it with real time example?

How many questions are you going continue asking about Hadoop without making some research? Read Hadoop: The Definitive Guide or something. — Balduz, Jul 31 '14 at 06:59
this might help: https://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata — vefthym, Jul 31 '14 at 12:37

score 0 · Answer 1 · answered Jul 31 '14 at 08:42

The distributed cache can contain small data files needed for initialization or libraries of code that may need to be accessed on all nodes in the cluster. Say for example you have to count no of words occurence in a huge set of file. And you have instructed that you have to count every words except these words in a file given say (ignore.csv which is also large file).

Then you read this ignore.csv in distributed cache is setup function of your mapper or reducer depends on your logic and store it in a data structure where you can access each word easily( e.g. HashMap).

This file will read and stored before mapper and reducer of any machine get started and this distributed cache is same for all the machines running in cluster.

I hope you understand now. Please comment your doubts if any.

score 0 · Answer 2 · edited May 23 '17 at 10:33

0

DistributedCache is a deprecated class in Hadoop. Here is the right way to use

Hadoop DistributedCache is deprecated - what is the preferred API?

DistributedCache copies the files to all the slave nodes. So that access is faster for the MR job running locally. The cache is not in RAM, its just a file system cache in all the local disk volume of all slave nodes

edited May 23 '17 at 10:33

Community

1
1

answered Aug 01 '14 at 11:23

Prabakaran

128
1
9

Distributed Cache in Hadoop

2 Answers2