I'm very new with hadoop stream and have some difficulties with the partitioning.
According to what is found in a line, my mapper function either returns
key1, 0, somegeneralvalues # some kind of "header" line where linetype = 0
or
key1, 1, value1, value2, othervalues... # "data" line, different values, linetype =1
To properly reduce I need to group all lines having the same key1, and to sort them by value1, value2, and the linetype ( 0 or 1), something like:
1 0 foo bar... # header first
1 1 888 999.... # data line, with lower value1
1 1 999 111.... # a few datalines may follow. Sort by value1,value2 should be performed
------------ #possible partition here, and only here in this example
2 0 baz foobar....
2 1 123 888...
2 1 123 999...
2 1 456 111...
Is there a way to ensure such partitioning ? so far I've tried to play with options such as
-partitioner,'org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner'
-D stream.num.map.output.key.fields=4 # please use 4 fields to sort data
-D mapred.text.key.partitioner.options=-k1,1 # please make partitions based on first key
or alternatively
-D num.key.fields.for.partition=1 # Seriously, please group by key1 !
which yet only brought rage and despair.
If it's worth mentioning it, my scripts work properly if I use cat data | mapper | sort | reduce and I'm using the amazon elastic map reduce ruby client, so I'm passing the options with
--arg '-D','options' for the ruby script.
Any help would be highly appreciated ! Thanks in advance