I have a file that has two columns id and timestamp. I'm count the number of sessions each value has - determined by inactivity for more than 30 minutes. However, I'm having trouble with the streaming commands. An example few row is as follows.
id,time
1,2015-02-05 01:01:01
1,2015-02-05 01:02:01
3,2015-02-05 02:01:01
3,2015-02-05 02:01:02
I know my mapper and reducer work correctly b/c I get the correct results when I only use one reducer. My problem is when I need to use more than one reducer I try to use the Partitioner to send the first value of the map output to one reducer and sort it by the second value in the map output. Any suggestions on how to accomplish this?
This is what I'm trying.
hadoop jar /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p470.103/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.2.jar \
-Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D stream.map.output.field.separator=, \
-D stream.num.map.output.key.fields=2 \
-D mapred.text.key.partitioner.options=-k1,1 \
-Dmapred.text.key.comparator.options=-k2,2 \
-input /in/ \
-output /out/ \
-mapper mapper1.py \
-file ${DIR}mapper.py \
-reducer reducerA.py \
-file ${DIR}reducer.py