On an Ubuntu virtual machine I have set up a single-node cluster as per Michael Noll's tutorial and this has been my starting point for writing a Hadoop program.
Also, for reference, this.
My program is in Python and uses Hadoop Streaming.
I have written a simple vector multiplication program where mapper.py
takes input files v1
and v2
, each containing a vector in the form 12,33,10
and returns the products. Then reducer.py
returns the sum of the products, i.e.:
mapper: map(mult,v1,v2)
reducer: sum(p1,p2,p3,...,pn)
mapper.py :
import sys
def mult(x,y):
return int(x)*int(y)
# Input comes from STDIN (standard input).
inputvec = tuple()
for i in sys.stdin:
i = i.strip()
inputvec += (tuple(i.split(",")),)
v1 = inputvec[0]
v2 = inputvec[1]
results = map(mult, v1, v2)
# Simply printing the results variable would print the tuple. This
# would be fine except that the STDIN of reduce.py takes all the
# output as input, including brackets, which can be problematic
# Cleaning the output ready to be input for the Reduce step:
for o in results:
print ' %s' % o,
reducer.py:
import sys
result = int()
for a in sys.stdin:
a = a.strip()
a = a.split()
for r in range(len(a)):
result += int(a[r])
print result
In the in
subdirectory I have v1
containing 5,12,20
and v2
containing 14,11,3
.
Testing locally, things work as expected:
hduser@ubuntu:~/VectMult$ cat in/* | python ./mapper.py
70 132 60
hduser@ubuntu:~/VectMult$ cat in/* | python ./mapper.py | sort
70 132 60
hduser@ubuntu:~/VectMult$ cat in/* | python ./mapper.py | sort | python ./reducer.py
262
When I run it in Hadoop, it appears to do so successfully and doesn't throw up any exceptions:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper python /home/hduser/VectMult3/mapper.py -reducer python /home/hduser/VectMult3/reducer.py -input /home/hduser/VectMult3/in -output /home/hduser/VectMult3/out4
Warning: $HADOOP_HOME is deprecated.
packageJobJar: [/app/hadoop/tmp/hadoop-unjar2168776605822419867/] [] /tmp/streamjob6920304075078514767.jar tmpDir=null
12/11/18 21:20:09 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/11/18 21:20:09 WARN snappy.LoadSnappy: Snappy native library not loaded
12/11/18 21:20:09 INFO mapred.FileInputFormat: Total input paths to process : 2
12/11/18 21:20:09 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
12/11/18 21:20:09 INFO streaming.StreamJob: Running job: job_201211181903_0009
12/11/18 21:20:09 INFO streaming.StreamJob: To kill this job, run:
12/11/18 21:20:09 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201211181903_0009
12/11/18 21:20:09 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201211181903_0009
12/11/18 21:20:10 INFO streaming.StreamJob: map 0% reduce 0%
12/11/18 21:20:24 INFO streaming.StreamJob: map 67% reduce 0%
12/11/18 21:20:33 INFO streaming.StreamJob: map 100% reduce 0%
12/11/18 21:20:36 INFO streaming.StreamJob: map 100% reduce 22%
12/11/18 21:20:45 INFO streaming.StreamJob: map 100% reduce 100%
12/11/18 21:20:51 INFO streaming.StreamJob: Job complete: job_201211181903_0009
12/11/18 21:20:51 INFO streaming.StreamJob: Output: /home/hduser/VectMult3/out4
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /home/hduser/VectMult3/out4/part-00000
Warning: $HADOOP_HOME is deprecated.
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /home/hduser/VectMult3/out4/
Warning: $HADOOP_HOME is deprecated.
Found 3 items
-rw-r--r-- 1 hduser supergroup 0 2012-11-18 22:05 /home/hduser/VectMult3/out4/_SUCCESS
drwxr-xr-x - hduser supergroup 0 2012-11-18 22:05 /home/hduser/VectMult3/out4/_logs
-rw-r--r-- 1 hduser supergroup 0 2012-11-18 22:05 /home/hduser/VectMult3/out4/part-00000
But when I check the output, all I find is a 0 byte empty file.
I can't work out what's gone wrong. Can anyone help?
Edit: Response to @DiJuMx
One way to fix this would be to output to a temporary file from map, and then use the temporary file in reduce.
Not sure Hadoop allows this? Hopefully someone who knows better can correct me on this.
Before attempting this, try writing a simpler version which just passes the data straight through with no processing.
I thought this was a good idea, just to check that the data is flowing through correctly. I used the following for this:
Both mapper.py and reducer.py
import sys
for i in sys.stdin:
print i,
What comes out should be exactly what went in. Still outputs an Empty File.
Alternatively, edit your existing code in reduce to output an (error) message to the output file if the input was blank
mapper.py
import sys
for i in sys.stdin:
print "mapped",
print "mapper",
reducer.py
import sys
for i in sys.stdin:
print "reduced",
print "reducer",
If input received, it should ultimately output reduced
. Either way, it should at least output reducer
. Actual output is an Empty File, still.