Given a hadoop cluster, I have a job for which I have a large set of files that need to be accessed by all workers while they perform their reduce stage.
It seems it would be a good idea to use the facilities of DistributedCache. However, it appears it does not satisfy the following desired behavior:
Lazy file fetching: files are copied to the workers lazily (only when attempted to be read are they cached locally).
getLocalCacheFiles is weird: another obviously related problem is that of the DistributedCache interface. to access the local files it seems, one needs to call DistributedCache.getLocalCacheFiles(conf). Is there a way to just request a certain file by name (ex: DistributedCache.getLocalFile(conf, fileName))
Can DistributedCache do this? Is there any other library that satisfies my requirements?
Thank you!