I am in the process of migrating to YARN and it seems the behavior of the DistributedCache changed.
Previously, I would add some files to the cache as follows:
for (String file : args) {
Path path = new Path(cache_root, file);
URI uri = new URI(path.toUri().toString());
DistributedCache.addCacheFile(uri, conf);
}
The path would typically look like
/some/path/to/my/file.txt
Which pre-exists on HDFS and would essentially end up in the DistributedCache as
/$DISTRO_CACHE/some/path/to/my/file.txt
I could symlink to it in my current working directory and use with DistributedCache.getLocalCacheFiles()
With YARN, it seems this file instead ends up in the cache as:
/$DISTRO_CACHE/file.txt
ie, the 'path' part of the file URI got dropped and only the filename remains.
How does with work with different absolute paths ending up with the same filename? Consider the following case:
DistributedCache.addCacheFile("some/path/to/file.txt", conf);
DistributedCache.addCacheFile("some/other/path/to/file.txt", conf);
Arguably someone could use fragments:
DistributedCache.addCacheFile("some/path/to/file.txt#file1", conf);
DistributedCache.addCacheFile("some/other/path/to/file.txt#file2", conf);
But this seems unnecessarily harder to manage. Imagine the scenario where those are command-line arguments, you somehow need to manage that those 2 filenames, although different absolute paths would definitely clash in the DistributedCache and therefore need to re-map these filenames to fragments and propagate as such to the rest of the program?
Is there an easier way to manage this?