I am new to Amazon EMR, and I am trying to understand how does the sorting phase after the map (before the reduce phase) works and if I can manipulate it (by some how supplying it my own compare function.
If you know how the output from the map phase needs to look like, it would be most helpful.
currently I have this simple map phase that prints out in this format:
"keyA|keyB|valueA1|valueA2"
And my reducer function receives these lines and merge them to:
"keyA|keyB|sum_valueA1|sum_valueA2"
The problem is that in the reducer phase I only get lines that are completely identical. Meaning that, the keys are identical and also there values. That's a problem, and doesn't allow me to utilize the full power of map-reduce.
I saw that they are using this format in their wordcount example:
"LongValueSum:key\t1".
Do I have to use the word "LongValueSum", and the tab for it to be identified as a key and not sort by value? Using the tab is a bit of a problem because the "key" could be with "\t" in it.
please help.