3

I'm using Hadoop 2.7.1

I'm really struggling to understand at what point in the streaming process sorts are applied, how you can change the sort order, and the separator. Reading the documentation has confused me further since some config variables seem to be used interchangeably (c.f. map.output.key.field.separator used in the partitioner example and mapreduce.map.output.key.field.separator in the comparator example.

My questions

  1. At what point in the process are each of these configuration variables used and what exactly do they affect (and is there a way by looking at them I can tell?)
  2. Additionally, what happens when they are used together? Do they over-ride each other?

    • mapreduce.map.output.key.field.separator
    • map.output.key.field.separator
    • stream.map.output.field.separator
    • stream.num.map.output.key.fields
    • mapreduce.partition.keypartitioner.options
    • mapred.text.key.partitioner.options
    • mapreduce.fieldsel.data.field.separator
    • mapreduce.fieldsel.map.output.key.value.fields.spec
    • mapreduce.fieldsel.reduce.output.key.value.fields.spec
    • stream.reduce.input.field.separator
    • stream.num.reduce.input.fields
    • stream.reduce.output.field.separator
    • stream.num.reduce.output.fields

Just search the streaming documentation to see the context for their introduction. Most are (not contained in mapred-default.xml).

I feel my confusion may be as a result of the distinction between separating the records into keys and values, and separating the key into multiple fields.

My understanding of the process (ignoring combiners for now):

  1. Input data splits sent to the Mappers
  2. Mapper outputs are split into (key, value) pairs and sent to the Partitioner
  3. Partitioner sends the Mapper outputs to Comparator based on the key (and the fields within the key?)
  4. The Comparator sorts each partition and sends it to a Reducer
James Owers
  • 7,948
  • 10
  • 55
  • 71
  • What do you mean by *fields within the key*? – OneCricketeer Dec 03 '15 at 22:24
  • There are variables called `...key.field.separator` and `...output.field.separator`. I presume that the [partitioner example](https://hadoop.apache.org/docs/r2.7.1/hadoop-streaming/HadoopStreaming.html#Hadoop_Partitioner_Class) has outputs from the mapper like this: `11.12.1.2` --> {[`11`,`12`,`1`,`2`], `None`} i.e. keys with 4 fields and no value. – James Owers Dec 03 '15 at 22:32

0 Answers0