I am trying to fetch file names from sequence file from hadoop with the help of dumbo package of python. But it provides me some kind of identifier. How can i map this to file name?
Below is my steps on hadoop system for getting filenames :
Steps 1) Generating Sequence file
Command :
hadoop jar /mnt/Clustering/Checking/AllJars/binarypig-1.0-SNAPSHOT-jar-with-dependencies.jar com.endgame.binarypig.util.BuildSequenceFileFromDir /mnt/Clustering/Checking/text_files text_files_seq
Step 2) Running python script on sequence file through hadoop
Command:
dumbo start dumbo_map_red.py -input text_files_seq -output out_res -hadoop /usr/local/hadoop
Step 3) Getting output in local directory
Command:
dumbo cat out_res/part-* -hadoop /usr/local/hadoop > out_res.txt
where dumbo_map_red.py is
#!/usr/bin/env python
def mapper(key, value):
yield key, 1
def reducer(key, values):
yield key, sum(values)
if __name__ == "__main__":
import dumbo
dumbo.run(mapper, reducer)
Please help me know how to fetch filenames. If their is other package in python which will allow me to work that way, please let me know..