0

When I execute a MapReduce job on a local setup I get the desired output from the reducer while the same code on EMR does not produce any. I have a cluster setup of 1 master and 10 core.

This is the output. There is no error displayed

Map-Reduce Framework
    Map input records=3000
    Map output records=378
    Map output bytes=36054
    Map output materialized bytes=40448
    Input split bytes=1420
    Combine input records=0
    Combine output records=0
    Reduce input groups=179
    Reduce shuffle bytes=40448
    Reduce input records=378
    Reduce output records=0
    Spilled Records=756
    Shuffled Maps =380
    Failed Shuffles=0
    Merged Map outputs=380
    GC time elapsed (ms)=23484
    CPU time spent (ms)=125780
    Physical memory (bytes) snapshot=9989242880
    Virtual memory (bytes) snapshot=52768247808
    Total committed heap usage (bytes)=6517702656
Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
File Input Format Counters 
    Bytes Read=711180681
File Output Format Counters 
    Bytes Written=0

Following the reducer code:

def reducer(self, key, val):
    best = -60
    best_name = None
    lat = 0
    longi = 0
    yr = 0
    genre = None

    for hot, name,lat,longi,yr,genre in val:
        if hot > best:
            best = hot
            best_name = name
            lat = lat
            longi = longi
            yr = yr
            genre = genre

    yield (key,(best,best_name,lat,longi,yr,genre))
Mikel Urkia
  • 2,087
  • 1
  • 23
  • 40
MUKUND
  • 11
  • 1
  • Could you please add the code where you indicate where to save the data on EMR? – Mikel Urkia Oct 08 '14 at 14:04
  • You have indeed some output as stated in `Map output bytes=36054`. Can you please provide exact command with which you invoke MRJob command on emr? – alko Oct 08 '14 at 15:34
  • In the above output `Reduce output records=0`. It is not guarantee that, any programs runs in Standalone mode(local mode) giving the output would be same output as running in Psuedo mode/ Distributed mode. – Mr.Chowdary Oct 10 '14 at 09:15
  • Thank you all for the feedback. Mikel , We did that as part of the step via the console and also passed the output as a command line argument ,it is S3 bucket location .hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar -files s3://millionmusic/prog/artisthotnessbycountry.py -mapper artisthotnessbycountry.py -reducer artisthotnessbycountry.py -input s3n://tbmmsd/*.tsv.* -output s3://millionmusic/artisthotnessbycountry/ – MUKUND Oct 10 '14 at 14:09
  • Alko , Yes the mapper and combiner outputs are fine.Have mentioned the command in my comment above to Mikel/home/hadoop/contrib/streaming/hadoop-streaming.jar -files s3://millionmusic/prog/artisthotnessbycountry.py -mapper artisthotnessbycountry.py -reducer artisthotnessbycountry.py -input s3n://tbmmsd/A.tsv.a -output s3://millionmusic/artisthotnessbycountry/ – MUKUND Oct 10 '14 at 14:20
  • Chowdary, We noticed that the records are going as input to the reducer and the reducer is the simple code as above , but the output is 0. We use MRjob lib. Not sure if any of the configurations needs to be updated .Just a note , we added a combiner method as well and that was successful at execution but the reducer fails to provide an output as mentioned .Any configuration or additional setup to be done ? – MUKUND Oct 10 '14 at 15:16

0 Answers0