2

I am writing hadoop app using MrJob. I need to use distributed cache to access to some files. I know that there is an option -files in hadoop streaming but don't know how to access it in the program.

Thanks for your help.

2 Answers2

2

I think You have to use

mrjob.compat.supports_new_distributed_cache_options(version)

And then use -files and -archives instead of -cacheFile and -cacheArchive

May be you will get more here

Manish Verma
  • 771
  • 7
  • 20
-1

You shall read files in your program as though the files are available there itself, i.e. the file is local in the same directory as the running code.

I am not good in python, hence here is the example in ruby, mapper.rb:

begin
    file = File.open("my-distributed-cache-file.txt")
    while (line = file.gets)
            # do something with your file
    end
    file.close
end
# Rest of mapper code
Amar
  • 11,930
  • 5
  • 50
  • 73
  • Thank you, the problem is that I try to use --hadoop-arg in MrJob to pass the cache file, i.e., --hadoop-arg -files hdfs://localhost:54310/cache.txt but it's not working. – user2257622 Apr 08 '13 at 14:37
  • This doesn't address the question, which specifically asks for mrjob, the Python package. – Taro Sato Sep 27 '13 at 22:06