8

I want to use the distributed cache to allow my mappers to access data. In main, I'm using the command

DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);

Where /user/peter/cacheFile/testCache1 is a file that exists in hdfs

Then, my setup function looks like this:

public void setup(Context context) throws IOException, InterruptedException{
    Configuration conf = context.getConfiguration();
    Path[] localFiles = DistributedCache.getLocalCacheFiles(conf);
    //etc
}

However, this localFiles array is always null.

I was initially running on a single-host cluster for testing, but I read that this will prevent the distributed cache from working. I tried with a pseudo-distributed, but that didn't work either

I'm using hadoop 1.0.3

thanks Peter

Peter Cogan
  • 865
  • 1
  • 11
  • 19
  • possible duplicate of [Files not put correctly into distributed cache](http://stackoverflow.com/questions/12708947/files-not-put-correctly-into-distributed-cache) – kabuko Jan 21 '13 at 23:15

4 Answers4

35

Problem here was that I was doing the following:

Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);

Since the Job constructor makes an internal copy of the conf instance, adding the cache file afterwards doesn't affect things. Instead, I should do this:

Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Job job = new Job(conf, "wordcount");

And now it works. Thanks to Harsh on hadoop user list for the help.

Peter Cogan
  • 865
  • 1
  • 11
  • 19
11
Configuration conf = new Configuration();  
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/userpetercacheFiletestCache1"),job.getConfiguration());

You can also do it in this way.

Rushi
  • 4,553
  • 4
  • 33
  • 46
StarScream
  • 223
  • 2
  • 12
4

Once the Job is assigned to with a configuration object, ie Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

And then if deal with attributes of conf as shown below, eg

conf.set("demiliter","|");

or

DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);

Such changes would not be reflected in a pseudo cluster or cluster how ever it would work with local environment.

likeitlikeit
  • 5,563
  • 5
  • 42
  • 56
user2458922
  • 1,691
  • 1
  • 17
  • 37
2

This version of code ( which is slightly different from the above mentioned constructs) has always worked for me.

//in main(String [] args)
Job job = new Job(conf,"Word Count"); 
...
DistributedCache.addCacheFile(new URI(/user/peter/cacheFile/testCache1), job.getConfiguration());

I didnt see the complete setup() function in Mapper code

public void setup(Context context) throws IOException, InterruptedException {

    Configuration conf = context.getConfiguration();
    FileSystem fs = FileSystem.getLocal(conf);

    Path[] dataFile = DistributedCache.getLocalCacheFiles(conf);

    // [0] because we added just one file.
    BufferedReader cacheReader = new BufferedReader(new InputStreamReader(fs.open(dataFile[0])));
    // now one can use BufferedReader's readLine() to read data

}
Community
  • 1
  • 1
Somum
  • 2,382
  • 26
  • 15