Amazon EMR sorting

Question

I am new to Amazon EMR, and I am trying to understand how does the sorting phase after the map (before the reduce phase) works and if I can manipulate it (by some how supplying it my own compare function.

If you know how the output from the map phase needs to look like, it would be most helpful.

currently I have this simple map phase that prints out in this format:

"keyA|keyB|valueA1|valueA2"

And my reducer function receives these lines and merge them to:

"keyA|keyB|sum_valueA1|sum_valueA2"

The problem is that in the reducer phase I only get lines that are completely identical. Meaning that, the keys are identical and also there values. That's a problem, and doesn't allow me to utilize the full power of map-reduce.

I saw that they are using this format in their wordcount example:

"LongValueSum:key\t1".

Do I have to use the word "LongValueSum", and the tab for it to be identified as a key and not sort by value? Using the tab is a bit of a problem because the "key" could be with "\t" in it.

please help.

score 1 · Answer 1 · answered Oct 26 '15 at 20:54

Found the answer
It was buried deep in the hadoop manual, something this basic should be in the "Getting started" section...

putting it here, hope it would save time for future developers:

from: http://hadoop.apache.org/docs/r1.2.1/streaming.html

Hadoop Partitioner Class

Hadoop has a library class, KeyFieldBasedPartitioner, p> that is useful for many applications. This class allows the Map/Reduce framework to partition the map outputs based on certain key fields, not the whole keys. For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar 
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2 \
-D mapred.reduce.tasks=12 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Here, -D stream.map.output.field.separator=. and -D stream.num.map.output.key.fields=4 are as explained in previous example.
The two variables are used by streaming to identify the key/value pair of mapper.
The map output keys of the above Map/Reduce job normally have four fields separated by ".". However, the Map/Reduce framework will partition the map outputs by the first two fields of the keys using the -D mapred.text.key.partitioner.options=-k1,2 option.
Here, -D map.output.key.field.separator=. specifies the separator for the partition.
This guarantees that all the key/value pairs with the same first two fields in the keys will be partitioned into the same reducer.
This is effectively equivalent to specifying the first two fields as the primary key and the next two fields as the secondary.
The primary key is used for partitioning, and the combination of the primary and secondary keys is used for sorting.

Amazon EMR sorting

1 Answers1