I'm using Hadoop 2.7.1
I'm really struggling to understand at what point in the streaming process sorts are applied, how you can change the sort order, and the separator. Reading the documentation has confused me further since some config variables seem to be used interchangeably (c.f. map.output.key.field.separator
used in the partitioner example and mapreduce.map.output.key.field.separator
in the comparator example.
My questions
- At what point in the process are each of these configuration variables used and what exactly do they affect (and is there a way by looking at them I can tell?)
Additionally, what happens when they are used together? Do they over-ride each other?
mapreduce.map.output.key.field.separator
map.output.key.field.separator
stream.map.output.field.separator
stream.num.map.output.key.fields
mapreduce.partition.keypartitioner.options
mapred.text.key.partitioner.options
mapreduce.fieldsel.data.field.separator
mapreduce.fieldsel.map.output.key.value.fields.spec
mapreduce.fieldsel.reduce.output.key.value.fields.spec
stream.reduce.input.field.separator
stream.num.reduce.input.fields
stream.reduce.output.field.separator
stream.num.reduce.output.fields
Just search the streaming documentation to see the context for their introduction. Most are (not contained in mapred-default.xml).
I feel my confusion may be as a result of the distinction between separating the records into keys and values, and separating the key into multiple fields.
My understanding of the process (ignoring combiners for now):
- Input data splits sent to the
Mappers
Mapper
outputs are split into (key
,value
) pairs and sent to thePartitioner
Partitioner
sends theMapper
outputs toComparator
based on the key (and the fields within the key?)- The
Comparator
sorts each partition and sends it to aReducer