Python code is valid but Hadoop Streaming produces part-00000 "Empty file"

Question

On an Ubuntu virtual machine I have set up a single-node cluster as per Michael Noll's tutorial and this has been my starting point for writing a Hadoop program.

Also, for reference, this.

My program is in Python and uses Hadoop Streaming.

I have written a simple vector multiplication program where mapper.py takes input files v1 and v2, each containing a vector in the form 12,33,10 and returns the products. Then reducer.py returns the sum of the products, i.e.:

mapper: map(mult,v1,v2)

reducer: sum(p1,p2,p3,...,pn)

mapper.py :

import sys

def mult(x,y):      
    return int(x)*int(y)

# Input comes from STDIN (standard input).

inputvec = tuple()

for i in sys.stdin:
    i = i.strip()

    inputvec += (tuple(i.split(",")),)

v1 = inputvec[0]
v2 = inputvec[1]

results = map(mult, v1, v2)

# Simply printing the results variable would print the tuple. This
# would be fine except that the STDIN of reduce.py takes all the 
# output as input, including brackets, which can be problematic

# Cleaning the output ready to be input for the Reduce step:

for o in results:
    print ' %s' % o,

reducer.py:

import sys

result = int()

for a in sys.stdin:

    a = a.strip()
    a = a.split()

for r in range(len(a)):
    result += int(a[r])

print result

In the in subdirectory I have v1 containing 5,12,20 and v2 containing 14,11,3.

Testing locally, things work as expected:

hduser@ubuntu:~/VectMult$ cat in/* | python ./mapper.py
 70  132  60

hduser@ubuntu:~/VectMult$ cat in/* | python ./mapper.py | sort
 70  132  60

hduser@ubuntu:~/VectMult$ cat in/* | python ./mapper.py | sort | python ./reducer.py
262

When I run it in Hadoop, it appears to do so successfully and doesn't throw up any exceptions:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper python /home/hduser/VectMult3/mapper.py -reducer python /home/hduser/VectMult3/reducer.py -input /home/hduser/VectMult3/in -output /home/hduser/VectMult3/out4
Warning: $HADOOP_HOME is deprecated.

packageJobJar: [/app/hadoop/tmp/hadoop-unjar2168776605822419867/] [] /tmp/streamjob6920304075078514767.jar tmpDir=null
12/11/18 21:20:09 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/11/18 21:20:09 WARN snappy.LoadSnappy: Snappy native library not loaded
12/11/18 21:20:09 INFO mapred.FileInputFormat: Total input paths to process : 2
12/11/18 21:20:09 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
12/11/18 21:20:09 INFO streaming.StreamJob: Running job: job_201211181903_0009
12/11/18 21:20:09 INFO streaming.StreamJob: To kill this job, run:
12/11/18 21:20:09 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201211181903_0009
12/11/18 21:20:09 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201211181903_0009
12/11/18 21:20:10 INFO streaming.StreamJob:  map 0%  reduce 0%
12/11/18 21:20:24 INFO streaming.StreamJob:  map 67%  reduce 0%
12/11/18 21:20:33 INFO streaming.StreamJob:  map 100%  reduce 0%
12/11/18 21:20:36 INFO streaming.StreamJob:  map 100%  reduce 22%
12/11/18 21:20:45 INFO streaming.StreamJob:  map 100%  reduce 100%
12/11/18 21:20:51 INFO streaming.StreamJob: Job complete: job_201211181903_0009
12/11/18 21:20:51 INFO streaming.StreamJob: Output: /home/hduser/VectMult3/out4

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /home/hduser/VectMult3/out4/part-00000
Warning: $HADOOP_HOME is deprecated.

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /home/hduser/VectMult3/out4/
Warning: $HADOOP_HOME is deprecated.

Found 3 items
-rw-r--r--   1 hduser supergroup          0 2012-11-18 22:05 /home/hduser/VectMult3/out4/_SUCCESS
drwxr-xr-x   - hduser supergroup          0 2012-11-18 22:05 /home/hduser/VectMult3/out4/_logs
-rw-r--r--   1 hduser supergroup          0 2012-11-18 22:05 /home/hduser/VectMult3/out4/part-00000

But when I check the output, all I find is a 0 byte empty file.

I can't work out what's gone wrong. Can anyone help?

Edit: Response to @DiJuMx

One way to fix this would be to output to a temporary file from map, and then use the temporary file in reduce.

Not sure Hadoop allows this? Hopefully someone who knows better can correct me on this.

Before attempting this, try writing a simpler version which just passes the data straight through with no processing.

I thought this was a good idea, just to check that the data is flowing through correctly. I used the following for this:

Both mapper.py and reducer.py
import sys

for i in sys.stdin:
    print i,

What comes out should be exactly what went in. Still outputs an Empty File.

Alternatively, edit your existing code in reduce to output an (error) message to the output file if the input was blank

mapper.py

import sys

for i in sys.stdin:
    print "mapped",

print "mapper",

reducer.py

import sys

for i in sys.stdin:
    print "reduced",

print "reducer",

If input received, it should ultimately output reduced. Either way, it should at least output reducer. Actual output is an Empty File, still.

score 0 · Answer 1 · answered Nov 19 '12 at 16:00

0

I haven't got any experience with hadoop (or python for that matter), however, what I did notice was you were specifying the output to go to /home/hduser/VectMult3/out4 but expecting it to be in /home/hduser/VectMult3/out/part-00000.

Have you checked the out4 file exists and what its contents are?

answered Nov 19 '12 at 16:00

DiJuMx

524
5
10

Have corrected for clarity. What happened is I pasted an old copy of the command line text but suffice to say that I've tried running it more than 4 times with the same result. Or rather, no result. The output file `part-00000` in `out4` is also an empty file. – dafuloth Nov 19 '12 at 21:10

score 0 · Answer 2 · answered Nov 20 '12 at 22:52

Assuming your code is completely correct (which as shown in the first part of your question I will assume so), the problem can be narrowed down to the environment you are working in. In this case, I'd say it was because the output from the mapper is NOT being piped into the reducer (as it is when you run the commands manually)

One way to fix this would be to output to a temporary file from map, and then use the temporary file in reduce.

Before attempting this, try writing a simpler version which just passes the data straight through with no processing. If you still get nothing out, try the temporary file.

Alternatively, edit your existing code in reduce to output an (error) message to the output file if the input was blank

I tried saying what I wanted to say in comments but it looked horrible so I tacked it onto the end of my post... — dafuloth, Nov 21 '12 at 07:17

score 0 · Answer 3 · answered Apr 03 '13 at 15:09

I know this question is fairly old but I would still like to help out. I see that in your reducer.py and mapper.py examples you are simply outputting a single value. I believe (I am just starting with Hadoop but this has been my experience so far) that it requires a key-value pair that is separated by a tab.

For example your mapper's output could be similar to this:
print "%s\t%s" % (str(random.randint(0,100000)), "mapped")

I am not sure but the reducer can probably output anything. If that still doesn't work then create a test case by following the instructions on this blog http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ it worked in my environment so hopefully it will do the same for you. If this example doesn't work then it is most likely an issue with your Hadoop setup.

Yann · Answer 4 · 2013-11-25T14:05:00.957

Have you tried to replace python mapper.py by "python mapper.py" ? I suppose that mappers run python instead of python mapper.py. It could explain your empty output. Also, your file mapper.py should not be located on HDFS, but rather somewhere locally. Then, ship it into the job jar file (doc), using the -file <path_to_local_file> option in your command line:

/bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper "python mapper.py" -file <path_to_your_local_mapper.py> -reducer "python reducer.py" -file <path_to_your_local_reducer.py> -input /home/hduser/VectMult3/in -output <output_dir_name>

score 0 · Answer 5 · answered Nov 25 '13 at 14:08

Have had similar problems to this and found that making sure the following line is at the top of my Python scripts fixed it:

#!/usr/bin/python

Maybe give that a go and see if it helps?

p.s. Additionally, just looking at our Python mappers and reducers, we use a print statement without the comment at the end of the line.

score 0 · Answer 6 · edited Jul 11 '15 at 02:59

I had the same problem. The mapper.py and reducer.py worked fine, but Hadoop streaming returned empty file, without any error!

I solved this problem by using the Hadoop Streaming code as below: (pay attention to the format I used for -mapper and -reducer!)

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-*streaming*.jar -file /home/training/Desktop/mapper.py -mapper 'python mapper.py' -file /home/training/Desktop/reducer.py -reducer 'python reducer.py' -input /user/training/sales.csv -output /user/training/output18

I hope this helps to others with the same problem.

Python code is valid but Hadoop Streaming produces part-00000 "Empty file"

6 Answers6