0

I am running a Hadoop program and have the following as my input file, input.txt:

1
2

mapper.py:

import sys
for line in sys.stdin:
    print line,
print "Test"

reducer.py:

import sys
for line in sys.stdin:
    print line,

When I run it without Hadoop: $ cat ./input.txt | ./mapper.py | ./reducer.py, the output is as expected:

1
2
Test

However, running it through Hadoop via the streaming API (as described here), the latter part of the output seems somewhat "doubled":

1
2
Test    
Test

Aditionally, when I run the program through Hadoop, it seems like it has a 1/4 chance of failing due to this:

Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

I've looked at this for some time and can't figure out what I'm not getting. If anyone could help with these issues, I would greatly appreciate it! Thanks.

edit: When input.txt is:

1
2
3
4
5
6
7
8
9
10

The output is:

1   
10  
2   
3   
4   
5   
6   
7   
8   
9   
Test    
Test
Objc55
  • 156
  • 1
  • 5
  • 18
  • I tried it and I am getting the same output even with hadoop streaming! – Amar Sep 25 '13 at 17:51
  • @Amar Very weird. It definitely doubles it for me. – Objc55 Sep 25 '13 at 18:02
  • Can you post the output Hadoop shows on the screen after you submit your job using the command you indicated (`bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper mapper.py -file /hadoop/reducer.py -reducer reducer.py -input /hadoop/input.txt -output /hadoop/output`)? – cabad Oct 02 '13 at 22:22
  • @cabad This is stated in the post -- the output is `1,2,Test,Test` with the commas representing line breaks. – Objc55 Oct 04 '13 at 16:47
  • I understand this. I mean the rest of the output; this provides information about how Hadoop is running your code. – cabad Oct 04 '13 at 18:10
  • @cabad I'm sorry -- I'm not quite sure what you mean, as that is the entire output. I have tried the same code with the input being the numbers 1-10, each on a new line, and will update this in the post. – Objc55 Oct 04 '13 at 18:21
  • @cabad If there is a specific input file you'd like me to test, please let me know. Thanks – Objc55 Oct 04 '13 at 18:31
  • Ok, I guess my terminology was confusing. I don't mean the output in "output.txt". I mean the output on the screen (i.e., on `stdout`). – cabad Oct 04 '13 at 18:31
  • @cabad Oh, gotcha. Other than the map and reduce percentages and a tracking URL, there is no information provided. – Objc55 Oct 04 '13 at 18:36
  • @cabad I'm wondering if it might be related to this issue: http://stackoverflow.com/questions/19188263/hadoop-and-python-disable-sorting – Objc55 Oct 04 '13 at 18:36
  • Sigh... I am not sure if you really wan't help or not. "Other than the map and reduce percentages and a tracking URL, there is no information provided." Are you sure? Don't you have a line that says "Total input paths to process : x"? – cabad Oct 04 '13 at 20:48
  • @cabad Yes, sorry. I assumed this was reflecting what's written in the command-line entry. It says `Total input paths to process : 1` – Objc55 Oct 04 '13 at 20:59
  • Ok, then I have no idea why you get that output. I can only think of three alternatives: (1) your input.txt file in HDFS has a "Test" string in the last line, (2) you are mistakenly using your mapper as reducer, or (3) you have, by mistake, added a print "Test" statement at the end of your reducer. – cabad Oct 04 '13 at 21:10
  • @cabad Thanks for trying, it's certainly weird. I have double-checked that none of those 3 possibilities are occurring. – Objc55 Oct 04 '13 at 21:22

1 Answers1

0

It gives the same output. I guess you are specifying the location of reducer to mapper.py only. Make sure you are providing correct path to reducer.py

user1525721
  • 336
  • 5
  • 12
  • I have tried this and unfortunately run into the same issue. This is the console command: `bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper mapper.py -file /hadoop/reducer.py -reducer reducer.py -input /hadoop/input.txt -output /hadoop/output` – Objc55 Sep 30 '13 at 19:45
  • Any ideas? No idea why this is happening :( – Objc55 Sep 30 '13 at 21:55