Using Hadoop Counters - Multiple jobs

Question

I am working on a mapreduce project using Hadoop. I currently have 3 sequential jobs.

I want to use Hadoop counters, but the problem is that I want to make the actual count in the first job, but access the counter value in the reducer of the 3rd job.

How can I achieve this? Where should I define the enum? Do I need to pass it threw the second job? It will also help to see some code example for doing this as I couldn't find anything yet.

Note: I am using Hadoop 2.7.2

EDIT: I already tried the approach explained here and it didn't succeeded. My case is different as I want to access the counters from a different job. (not from mapper to reducer).

What I tried to do: First Job:

public static void startFirstJob(String inputPath, String outputPath) throws IOException, ClassNotFoundException, InterruptedException {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "wordCount");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(WordCountMapper.class);
    job.setCombinerClass(WordCountReducer.class);
    job.setReducerClass(WordCountReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(LongWritable.class);
    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    FileInputFormat.addInputPath(job, new Path(inputPath));
    FileOutputFormat.setOutputPath(job, new Path(outputPath));
    job.waitForCompletion(true);
}

Defined the counter enum in a different class:

public class CountersClass {
    public static enum N_COUNTERS {
        SOMECOUNT
    }
}

Trying to read counter:

Cluster cluster = new Cluster(context.getConfiguration());
Job job = cluster.getJob(JobID.forName("wordCount"));
Counters counters = job.getCounters();
CountersClass.N_COUNTERS mycounter = CountersClass.N_COUNTERS.valueOf("SOMECOUNT");
Counter c1 = counters.findCounter(mycounter);
long N_Count = c1.getValue();

Possible duplicate of [Is there a way to access number of successful map tasks from a reduce task in an MR job?](http://stackoverflow.com/questions/8009802/is-there-a-way-to-access-number-of-successful-map-tasks-from-a-reduce-task-in-an) — tworec, Jul 13 '16 at 19:47
I think it's not a good idea to use counters from within reduce job. see http://stackoverflow.com/questions/8009802/is-there-a-way-to-access-number-of-successful-map-tasks-from-a-reduce-task-in-an/ — tworec, Jul 13 '16 at 19:47
Yes, I saw this already and I tried this approach. But in that case he wants to get the counters inside the reducer (of the same job). It is not the same as in my case. — A. Sarid, Jul 13 '16 at 19:55

yurgis · Accepted Answer · 2016-07-13T21:49:21.053

5

Classic solution is to put job's counter value into a configuration of a subsequent job where you need to access it:

So make sure to increment it correctly in the counting job mapper/reducer:

context.getCounter(CountersClass.N_COUNTERS.SOMECOUNT).increment(1);

Then after counting job completion:

job.waitForCompletion(true);

Counter someCount = job.getCounters().findCounter(CountersClass.N_COUNTERS.SOMECOUNT);

//put counter value into conf object of the job where you need to access it
//you can choose any name for the conf key really (i just used counter enum name here)
job2.getConfiguration().setLong(CountersClass.N_COUNTERS.SOMECOUNT.name(), someCount.getValue());

Next piece is to access it in another job's mapper/reducer. Just override setup() For example:

private long someCount;

@Override
protected void setup(Context context) throws IOException,
    InterruptedException {
  super.setup(context);
  this.someCount  = context.getConfiguration().getLong(CountersClass.N_COUNTERS.SOMECOUNT.name(), 0));
}

edited Jul 13 '16 at 21:49

answered Jul 13 '16 at 21:33

yurgis

4,017
1
13
22

Thanks! What if I have more then one counter inside this `enum`? can I just replace `setLong` and `getLong` with `setEnum` and `getEnum`? Or I will need to do it what you said for all counters? – A. Sarid Jul 14 '16 at 05:52
1

Each enum item should correspond to a separate config key. You still use setLong getLong to access them by their respective keys – yurgis Jul 14 '16 at 05:57
I know this is old question. But lets assume that the jobs start after some delay, won't the delayed job overwrite the the counter set by the earlier started job when run on a cluster ? – user238607 Feb 15 '18 at 15:11
the answer above assumes 2 jobs executed from a driver on a same jvm instance. if you are talking about accessing a counter from a previous job you better store its results somewhere to access it later. – yurgis Feb 15 '18 at 17:13

Radim · Answer 2 · 2016-07-13T20:09:39.370

2

Get the counters at the end of your 1st job and write their value into a file and read it in you sub-sequent job. Write it to HDFS if you want to read it from reducer or to local file if you will read and initialize in the application code.

Counters counters = job.getCounters(); Counter c1 = counters.findCounter(COUNTER_NAME); System.out.println(c1.getDisplayName()+":"+c1.getValue());

Reading and writing files is part of basic tutorials.

edited Jul 13 '16 at 20:09

answered Jul 13 '16 at 20:04

Radim

4,721
1
22
25

That may be an option. Can you please add the part of code needed for this? thanks – A. Sarid Jul 13 '16 at 20:06

Using Hadoop Counters - Multiple jobs

2 Answers2