Hadoop DistributedCache caching files without absolute path?

Question

I am in the process of migrating to YARN and it seems the behavior of the DistributedCache changed.

Previously, I would add some files to the cache as follows:

for (String file : args) {
   Path path = new Path(cache_root, file);
   URI uri = new URI(path.toUri().toString());
   DistributedCache.addCacheFile(uri, conf);
}

The path would typically look like

/some/path/to/my/file.txt

Which pre-exists on HDFS and would essentially end up in the DistributedCache as

/$DISTRO_CACHE/some/path/to/my/file.txt

I could symlink to it in my current working directory and use with DistributedCache.getLocalCacheFiles()

With YARN, it seems this file instead ends up in the cache as:

/$DISTRO_CACHE/file.txt

ie, the 'path' part of the file URI got dropped and only the filename remains.

How does with work with different absolute paths ending up with the same filename? Consider the following case:

DistributedCache.addCacheFile("some/path/to/file.txt", conf);
DistributedCache.addCacheFile("some/other/path/to/file.txt", conf);

Arguably someone could use fragments:

DistributedCache.addCacheFile("some/path/to/file.txt#file1", conf);
DistributedCache.addCacheFile("some/other/path/to/file.txt#file2", conf);

But this seems unnecessarily harder to manage. Imagine the scenario where those are command-line arguments, you somehow need to manage that those 2 filenames, although different absolute paths would definitely clash in the DistributedCache and therefore need to re-map these filenames to fragments and propagate as such to the rest of the program?

Is there an easier way to manage this?

Gaurav Mishra · Answer 1 · 2015-03-17T16:05:29.863

0

Try to add files into Job

It's most likely how you're actually configuring the job and then accessing them in the Mapper.

When you're setting up the job you're going to do something like

    job.addCacheFile(new Path("cache/file1.txt").toUri());
    job.addCacheFile(new Path("cache/file2.txt").toUri());

Then in your mapper code the urls are going to be stored in an array which can be accessed like so.

    URI file1Uri = context.getCacheFiles()[0];
    URI file2Uri = context.getCacheFiles()[1];

Hope this could help you.

edited Mar 17 '15 at 16:05

answered Mar 17 '15 at 15:55

Gaurav Mishra

1,009
6
11

That would work too, but getting the files isn't the problem really. The part where it drops the relative path makes it problematic for 2 reasons: if the file was an argument, I can't simply propagate the argument since it won't be found (path is dropped). Also 2 files with the same filename (but different paths) now conflict (without any Exception or errors). This wasn't a concern before YARN since the path was added as is to the DistributedCache. – cobralucha Mar 18 '15 at 05:53

Hadoop DistributedCache caching files without absolute path?

1 Answers1