How to read file names and word count in respective files in Hadoop?

Question

I am trying to fetch file names from sequence file from hadoop with the help of dumbo package of python. But it provides me some kind of identifier. How can i map this to file name?

Below is my steps on hadoop system for getting filenames :

Steps 1) Generating Sequence file

Command :

hadoop jar /mnt/Clustering/Checking/AllJars/binarypig-1.0-SNAPSHOT-jar-with-dependencies.jar com.endgame.binarypig.util.BuildSequenceFileFromDir /mnt/Clustering/Checking/text_files text_files_seq

Step 2) Running python script on sequence file through hadoop

Command:

dumbo start dumbo_map_red.py -input text_files_seq -output out_res -hadoop /usr/local/hadoop

Step 3) Getting output in local directory

Command:

dumbo cat out_res/part-* -hadoop /usr/local/hadoop > out_res.txt

where dumbo_map_red.py is

#!/usr/bin/env python

def mapper(key, value):
    yield key, 1

def reducer(key, values):
    yield key, sum(values)

if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper, reducer)

Please help me know how to fetch filenames. If their is other package in python which will allow me to work that way, please let me know..

score 0 · Answer 1 · answered Jan 05 '15 at 13:59

0

Finally got the hint of mapping identifier in sequence file to actual file.

The identifier is MD5 of file in the directory.

answered Jan 05 '15 at 13:59

Sanjay Bhosale

685
2
8
18

How to read file names and word count in respective files in Hadoop?

1 Answers1