0

I'm new to hadoop and mapreduce, I'm trying to write a mapreduce that counts the top 10 count words of a word count txt file.

My txt file 'q2_result.txt' looks like:

yourself        268
yourselves      73
yoursnot        1
youst   1
youth   270
youthat 1
youthful        31
youths  9
youtli  1
youwell 1
youwondrous     1
youyou  1
zanies  1
zany    1
zeal    32
zealous 6
zeals   1

Mapper:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    line = line.strip()
    word, count = line.split()
    print "%s\t%s" % (word, count)

Reducer:

#!usr/bin/env/ python

import sys

top_n = 0
for line in sys.stdin:
    line = line.strip()
    word, count = line.split()

    top_n += 1
    if top_n == 11:
        break
    print '%s\t%s' % (word, count)

I know you can pass a flag to -D option in Hadoop jar command so it sorts on the key you want(in my case the count which is k2,2), here I'm just using a simple command first:

hadoop jar /usr/hdp/2.5.0.0-1245/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.0.0-1245.jar -file /root/LAB3/mapper.py -mapper mapper.py -file /root/LAB3/reducer.py -reducer reducer.py -input /user/root/lab3/q2_result.txt -output /user/root/lab3/test_out

So I thought such simple mapper and reducer shouldn't give me errors, but it did and I can't figure out why, errors here: http://pastebin.com/PvY4d89c

(I'm using the Horton works HDP Sandbox on a virtualBox on Ubuntu16.04)

Sam
  • 475
  • 1
  • 7
  • 19
  • Please check this out http://stackoverflow.com/questions/4339788/hadoop-streaming-unable-to-find-file-error – Rahmath Sep 30 '16 at 19:04

1 Answers1

0

I know, "file not found error" means something completely different from "file cannot be executed", in this case the problem is that the file cannot be executed.

In Reducer.py:

Wrong:

#!usr/bin/env/ python

Correct:

#!/usr/bin/env python
ozw1z5rd
  • 3,034
  • 3
  • 32
  • 49
  • I cannot believe I missed it..., and could you explain why this difference would cause error in hadoop streaming? And I sort of understand including #! is telling hadoop that you are executing python files. – Sam Oct 03 '16 at 01:22
  • 1
    `env is a program localted in /usr/bin.` Writing `usr/bin/env/` actually you are running a directory. This program allow you to use python without using an absolute path. Using #! you are telling which program executes the script and it must exists and be runnable. – ozw1z5rd Oct 03 '16 at 06:02