I am writing hadoop app using MrJob. I need to use distributed cache to access to some files. I know that there is an option -files in hadoop streaming but don't know how to access it in the program.
Thanks for your help.
I am writing hadoop app using MrJob. I need to use distributed cache to access to some files. I know that there is an option -files in hadoop streaming but don't know how to access it in the program.
Thanks for your help.
I think You have to use
mrjob.compat.supports_new_distributed_cache_options(version)
And then use -files and -archives instead of -cacheFile and -cacheArchive
May be you will get more here
You shall read files in your program as though the files are available there itself, i.e. the file is local in the same directory as the running code.
I am not good in python, hence here is the example in ruby, mapper.rb
:
begin
file = File.open("my-distributed-cache-file.txt")
while (line = file.gets)
# do something with your file
end
file.close
end
# Rest of mapper code