Distributing Hadoop Streaming Output files on basis of Keys

Question

I have written a mapper function that parses the XML and outputs the result as columns separted by "\t" as shown below

Name Age
ABC   23
XYZ   24
ERT   25

Using the Hadoop Streaming Code as mentioned below, I am trying to partition the data on the basis of key. I want to make different output folders for different keys i.e

Instead of getting part-00000 , part-00001 and part-00003 as the output files in the output folder "out_parse_part16", I want the folders to be created like this
/out_parse_part16/ABC,

/out_parse_part16/XYZ,

/out_parse_part16/ERT

I want the data to be partitioned to different folders defined by the keys. Is there any way to do that by creating output folders in my reducer code based on keys ?

/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/bin/hadoop jar /opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/jars/hadoop-streaming-2.6.0-cdh5.5.2.jar \
-Dstream.recordreader.begin="<START_REC>" \
-Dstream.recordreader.end="</START_REC>" \
-D mapred.job.name="parse_with_partition"  \
-D stream.num.map.output.key.fields=1 \
-D map.output.key.field.separator=\t \
-D mapred.text.key.partitioner.options="-k1nr" \
-inputreader "org.apache.hadoop.streaming.StreamXmlRecordReader,begin =<START_REC>,end=</START_REC>" \
-file csv_parser_part.py \
-mapper csv_parser_part.py \
-file csv_part_reducer.py \
-reducer /csv_part_reducer.py \
-input TEST_XML2.xml \
-output out_parse_part16 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-verbose

Have you looked at this answer: http://stackoverflow.com/questions/18541503/multiple-output-files-for-hadoop-streaming-with-python-mapper — Binary Nerd, Jun 13 '16 at 11:04
Hey. Thanks for looking into this. I have been through the link that you shared. The issue is hat i am storing my output as Avro and using -outputformat org.apache.avro.mapred.AvroTextOutputFormat which i didnt mention in my question. Apologies for that. Is there any way i can do it using my reducer code ? — Rohit Guglani, Jun 13 '16 at 11:12
As with the example above, you would need to extend the `AvroTextOutputFormat` to control what the files are called. It might be possible. That `OutputFormat` extends `FileOutputFormat` so you would need to look at what methods to override in that class. It might be possible. — Binary Nerd, Jun 13 '16 at 12:07

score 0 · Answer 1 · answered Jun 17 '16 at 03:04

I think you need a jar like this.

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;

public class TestMultipleOutputFormat extends MultipleTextOutputFormat<Text, Text> {  

    protected String generateFileNameForKeyValue(Text key, Text value, String name)   
    {  
        String strValue = value.toString();  
        String outputName = name;  
        return outputName ; 
    }  

}

Distributing Hadoop Streaming Output files on basis of Keys

1 Answers1