Using Dynamic Distributed Cache in Hadoop

Question

I have two files. I want one file to be in Distributed Cache and other to be send to mapper.

But the one to be in Distributed Cache is very large file. What I am planning is to divide that file through a mapper and then send the blocks in parallel to another map process.

Any Idea how to implement this.

score 0 · Answer 1 · edited May 23 '17 at 10:31

0

First of all, the reason why Distributed Cache exists, is that all the mappers have (read) access to a common file(s), e.g. a list of stopwords. If you don't need that, then you don't need a Distributed Cache. Furthermore, if the two files that you describe are of the same format and you handle them in the same way, then just pass their root directory(ies) as input to your mapper. Hadoop will handle both of them the same way and split both of them. If that is not the case, then continue reading my answer.

If you want to use the output of the first mapper as the (single) input of the second mapper, then you can use a ChainMapper.

But I guess that you also want to use the second input file. So you can split your job in a chain of two jobs. Then the input of the second job's mapper can be a combination of both input files: the output of the first job and a file, as long as they are in the same format. You can use the addInputPath method for this reason.

Otherwise, you can get your file directly from the filesystem, as described here.

Note that if your large file is larger than a block's size (by default 64 MB), and it is splittable, hadoop splits its "automatically", when it is given as input to a mapper.

edited May 23 '17 at 10:31

Community

1
1

answered Feb 22 '14 at 08:20

vefthym

7,422
6
32
58

I want to use one complete file as an input to all the mappers. so I am using distributed cache. At the same time my problem is the file is too large to fir in cache at once. So I want to divide the file and pass file block by block as distributed cache. – Pooja3101 Feb 22 '14 at 17:20
A simple example is a spell check program. I have a file to be checked for spell error. and I have dictionary file to be used as dist cache. But dictionary file is very large. So I want is to divide the dictionary file into 10 subfiles. suppose I have a map phase in which 4 mappers are running each with one block of spellcheck file. what I want is to send all the dictionary part-files to each mapper one by one. – Pooja3101 Feb 22 '14 at 17:22
Then use the last option from my suggestions and get each block directly from the filesystem. Splitting a file into more files does not need to be done in hadoop. – vefthym Feb 22 '14 at 18:16
What if the file to be used as dist cache needs to be first processed in mapper and then to be used as cache in next map phase? – Pooja3101 Feb 22 '14 at 19:07
THe easiest way to do it is by using two jobs. One for processing this file without a reducer, or with an identity reducer, if you need it sorted and another one for the spell checking program – vefthym Feb 22 '14 at 19:16

Using Dynamic Distributed Cache in Hadoop

1 Answers1