Hadoop DistributedCache is deprecated - what is the preferred API?

Question

My map tasks need some configuration data, which I would like to distribute via the Distributed Cache.

The Hadoop MapReduce Tutorial shows the usage of the DistributedCache class, roughly as follows:

// In the driver
JobConf conf = new JobConf(getConf(), WordCount.class);
...
DistributedCache.addCacheFile(new Path(filename).toUri(), conf); 

// In the mapper
Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job);
...

However, DistributedCache is marked as deprecated in Hadoop 2.2.0.

What is the new preferred way to achieve this? Is there an up-to-date example or tutorial covering this API?

score 53 · Accepted Answer · answered Jan 20 '14 at 17:53

53

The APIs for the Distributed Cache can be found in the Job class itself. Check the documentation here: http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html The code should be something like

Job job = new Job();
...
job.addCacheFile(new Path(filename).toUri());

In your mapper code:

Path[] localPaths = context.getLocalCacheFiles();
...

answered Jan 20 '14 at 17:53

user2371156

735
8
14

Thanks - and I assume that I therefore need to use the newer `mapreduce` API rather than `mapred`, otherwise the `JobContext` object is not provided to the mapper... – DNA Jan 20 '14 at 20:19
11

I think `getLocalCacheFiles()` is deprecated, but `getCacheFiles()` is OK - returns URIs not Paths though. – DNA Jan 21 '14 at 13:14
Nice! This is a much cleaner and simpler API than using DistributedCache. – Nishant Kelkar Oct 19 '14 at 11:36
2

@DNA I don't think `getLocalCacheFiles()` and `getCacheFiles()` are the same. You can check my question(http://stackoverflow.com/questions/26492964/are-getcachefiles-and-getlocalcachefiles-the-same). If you want to access localized files but don't want to use the deprecated api, you can use the file name to directly open it(the behind technique is called symbolic link). – zli89 Oct 22 '14 at 19:19
but what if we use some framework (like cascading) that creates the jobs? We can only pass the jobconf to cascading framework - whats the alternative to DistributedCache in this case? – user2023507 Dec 26 '14 at 21:24
context.getLocalCacheFiles(); is deprecated using Hadoop 2.6.4 – Somum Mar 12 '16 at 06:47
I am using `JobConf` in my driver. How do I move to `Job`, as in, what changes would I have to make? – Flame of udun Sep 25 '16 at 22:56

score 25 · Answer 2 · answered Oct 17 '14 at 08:36

To expand on @jtravaglini, the preferred way of using DistributedCache for YARN/MapReduce 2 is as follows:

In your driver, use the Job.addCacheFile()

public int run(String[] args) throws Exception {
    Configuration conf = getConf();

    Job job = Job.getInstance(conf, "MyJob");

    job.setMapperClass(MyMapper.class);

    // ...

    // Mind the # sign after the absolute file location.
    // You will be using the name after the # sign as your
    // file name in your Mapper/Reducer
    job.addCacheFile(new URI("/user/yourname/cache/some_file.json#some"));
    job.addCacheFile(new URI("/user/yourname/cache/other_file.json#other"));

    return job.waitForCompletion(true) ? 0 : 1;
}

And in your Mapper/Reducer, override the setup(Context context) method:

@Override
protected void setup(
        Mapper<LongWritable, Text, Text, Text>.Context context)
        throws IOException, InterruptedException {
    if (context.getCacheFiles() != null
            && context.getCacheFiles().length > 0) {

        File some_file = new File("./some");
        File other_file = new File("./other");

        // Do things to these two files, like read them
        // or parse as JSON or whatever.
    }
    super.setup(context);
}

And where is this documented? – Christopher Mar 22 '19 at 22:15 — Christopher, Mar 22 '19 at 22:15

score 5 · Answer 3 · answered Jan 20 '14 at 17:58

5

The new DistributedCache API for YARN/MR2 is found in the org.apache.hadoop.mapreduce.Job class.

   Job.addCacheFile()

Unfortunately, there aren't as of yet many comprehensive tutorial-style examples of this.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29

answered Jan 20 '14 at 17:58

jtravaglini

1,676
11
19

I have no idea how to retrieve these cache files added using `Job.addCacheFile(URI)`. It does not work for me using the old way (`context.getCacheFiles()`), because the files are null. – tolgap Oct 17 '14 at 07:17

score 2 · Answer 4 · answered Oct 15 '15 at 01:10

I did not use job.addCacheFile(). Instead I used -files option like "-files /path/to/myfile.txt#myfile" as before. Then in the mapper or reducer code I use the method below:

/**
 * This method can be used with local execution or HDFS execution. 
 * 
 * @param context
 * @param symLink
 * @param throwExceptionIfNotFound
 * @return
 * @throws IOException
 */
public static File findDistributedFileBySymlink(JobContext context, String symLink, boolean throwExceptionIfNotFound) throws IOException
{
    URI[] uris = context.getCacheFiles();
    if(uris==null||uris.length==0)
    {
        if(throwExceptionIfNotFound)
            throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
        return null;
    }
    URI symlinkUri = null;
    for(URI uri: uris)
    {
        if(symLink.equals(uri.getFragment()))
        {
            symlinkUri = uri;
            break;
        }
    }   
    if(symlinkUri==null)
    {
        if(throwExceptionIfNotFound)
            throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
        return null;
    }
    //if we run this locally the file system URI scheme will be "file" otherwise it should be a symlink
    return "file".equalsIgnoreCase(FileSystem.get(context.getConfiguration()).getScheme())?(new File(symlinkUri.getPath())):new File(symLink);

}

Then in mapper/reducer:

@Override
protected void setup(Context context) throws IOException, InterruptedException
{
    super.setup(context);

    File file = HadoopUtils.findDistributedFileBySymlink(context,"myfile",true);
    ... do work ...
}

Note that if I used "-files /path/to/myfile.txt" directly then I need to use "myfile.txt" to access the file since that is the default symlink name.

score 1 · Answer 5 · answered Jun 01 '15 at 12:33

I had the same problem. And not only is DistributedCach deprecated but getLocalCacheFiles and "new Job" too. So what worked for me is the following:

Driver:

Configuration conf = getConf();
Job job = Job.getInstance(conf);
...
job.addCacheFile(new Path(filename).toUri());

In Mapper/Reducer setup:

@Override
protected void setup(Context context) throws IOException, InterruptedException
{
    super.setup(context);

    URI[] files = context.getCacheFiles(); // getCacheFiles returns null

    Path file1path = new Path(files[0])
    ...
}

score 1 · Answer 6 · answered Mar 13 '16 at 10:30

None of the solution mentioned worked for me in completeness . It could because Hadoop version keeps changing I am using hadoop 2.6.4. Essentially, DistributedCache is deprecated so I didnt want to use that. As some of the post suggest us to use addCacheFile() however, it has changed a bit. Here is how it worked for me

job.addCacheFile(new URI("hdfs://X.X.X.X:9000/EnglishStop.txt#EnglishStop.txt"));

Here X.X.X.X can be Master IP address or localhost. The EnglishStop.txt was stored in HDFS at / location.

hadoop fs -ls /

The output is

-rw-r--r--   3 centos supergroup       1833 2016-03-12 20:24 /EnglishStop.txt
drwxr-xr-x   - centos supergroup          0 2016-03-12 19:46 /test

Funny but convenient, #EnglishStop.txt means now we can access it as "EnglishStop.txt" in mapper. Here is the code for the same

public void setup(Context context) throws IOException, InterruptedException     
{
    File stopwordFile = new File("EnglishStop.txt");
    FileInputStream fis = new FileInputStream(stopwordFile);
    BufferedReader reader = new BufferedReader(new InputStreamReader(fis));

    while ((stopWord = reader.readLine()) != null) {
        // stopWord is a word read from Cache
    }
}

This just worked for me. You can read line from the file stored in HDFS

score 0 · Answer 7 · answered Mar 08 '22 at 06:44

I just wanted to add something else to patapouf_ai's answer. If you need to read the content of the file in the setup after saving the file in the cache, you have to do something like this:

In Mapper/Reducer setup:

protected void setup (Context context) throws IOException, InterruptedException {
        super.setup(context);           
        //Get FileSystem object to read file
        Configuration conf = context.getConfiguration();
        FileSystem fs = FileSystem.get(conf);

        URI[] files = context.getCacheFiles();
        Path patternsFile = new Path(files[0]);
        parseSkipFile(patternsFile,fs);
}
        
private void parseSkipFile(Path patternsFile, FileSystem fs) {
        try {
            BufferedReader fis = new BufferedReader(new InputStreamReader(fs.open(patternsFile)));
            String pattern = null;
            while ((pattern = fis.readLine()) != null) {
                //Here you can do whatever you want by reading the file line by line
                patternsToSkip.add(pattern);
            }
            fis.close();
            
        }catch (IOException ioe) {
            System.err.println("Caught exception while parsing the cached file '" + patternsFile + "' : " + StringUtils.stringifyException(ioe)); 
        }
    }

score 0 · Answer 8 · answered Apr 13 '22 at 14:05

This has been tested for hadoop 3.3.2.

Add this in the driver class (we need mainly addCacheFile)

package src;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;

public class Driver {
    public static void main(String[] args) throws Exception{

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "distributed cache");
        job.addCacheFile(new Path("/fileToBeCachedFromHDFS.txt").toUri());
        System.exit(job.waitForCompletion(true)?0:1);
    }
}

Then,access the cached file inside setup function (using FileSystem and BufferedReader).

import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import org.apache.hadoop.fs.FileSystem;

import java.io.BufferedReader;
import java.io.FileNotFoundException;

import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;

public class MyMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {

    @Override
    public void setup(Context context) throws IOException, InterruptedException {

        URI[] cacheFile = context.getCacheFiles(); 

        if (cacheFile!=null && cacheFile.length > 0) {
                // 
                try{
                    //
                    String line = ""; 
                    FileSystem fs = FileSystem.get(context.getConfiguration());
                    Path getFilePath = new Path(cacheFile[0].toString());
                    
                    BufferedReader myReader = new BufferedReader(
                        new InputStreamReader(fs.open(getFilePath)));
                                        
                    while((line = myReader.readLine())!= null) {
                        String[] words = line.split(",");
                    }

                    myReader.close();
                } 
                catch (FileNotFoundException e) {
                    System.out.println("An error occured");
                }
            }
    }
}

Hadoop DistributedCache is deprecated - what is the preferred API?

8 Answers8

Linked