1

I have a file that has two columns id and timestamp. I'm count the number of sessions each value has - determined by inactivity for more than 30 minutes. However, I'm having trouble with the streaming commands. An example few row is as follows.

id,time
1,2015-02-05 01:01:01
1,2015-02-05 01:02:01
3,2015-02-05 02:01:01
3,2015-02-05 02:01:02

I know my mapper and reducer work correctly b/c I get the correct results when I only use one reducer. My problem is when I need to use more than one reducer I try to use the Partitioner to send the first value of the map output to one reducer and sort it by the second value in the map output. Any suggestions on how to accomplish this?

This is what I'm trying.

hadoop jar /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p470.103/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.2.jar \
-Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D stream.map.output.field.separator=, \
-D stream.num.map.output.key.fields=2 \
-D mapred.text.key.partitioner.options=-k1,1 \
-Dmapred.text.key.comparator.options=-k2,2 \
-input /in/ \
-output /out/  \
-mapper mapper1.py \
-file ${DIR}mapper.py \
-reducer reducerA.py \
-file ${DIR}reducer.py
cloud36
  • 1,026
  • 6
  • 21
  • 35
  • Yes you can. But what is your specific problem? Values coming to reducers not properly sorted or partitioned? – yurgis Feb 18 '15 at 19:13

1 Answers1

0

Change "-Dmapred.text.key.comparator.options=-k2,2" to "-Dmapred.text.key.comparator.options=-k1,2" so the records a reducer receives are sorted first by id and then by time. Also your reducer needs to compare successive keys (id) of records and only count sessions for the records with equal ids.

Jeff Kubina
  • 800
  • 4
  • 15