I have written a mapper function that parses the XML and outputs the result as columns separted by "\t" as shown below
Name Age
ABC 23
XYZ 24
ERT 25
Using the Hadoop Streaming Code as mentioned below, I am trying to partition the data on the basis of key. I want to make different output folders for different keys i.e
Instead of getting part-00000 , part-00001 and part-00003 as the output files in the output folder "out_parse_part16",
I want the folders to be created like this
/out_parse_part16/ABC,
/out_parse_part16/XYZ,
/out_parse_part16/ERT
I want the data to be partitioned to different folders defined by the keys. Is there any way to do that by creating output folders in my reducer code based on keys ?
/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/bin/hadoop jar /opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/jars/hadoop-streaming-2.6.0-cdh5.5.2.jar \
-Dstream.recordreader.begin="<START_REC>" \
-Dstream.recordreader.end="</START_REC>" \
-D mapred.job.name="parse_with_partition" \
-D stream.num.map.output.key.fields=1 \
-D map.output.key.field.separator=\t \
-D mapred.text.key.partitioner.options="-k1nr" \
-inputreader "org.apache.hadoop.streaming.StreamXmlRecordReader,begin =<START_REC>,end=</START_REC>" \
-file csv_parser_part.py \
-mapper csv_parser_part.py \
-file csv_part_reducer.py \
-reducer /csv_part_reducer.py \
-input TEST_XML2.xml \
-output out_parse_part16 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-verbose