I am using Hadoop version 2.6.4 . I was writing a MapReduce job which would take 3 arguments namely -Keyword,Path to input files and output files. My ideal output should be names of all those files containing the keyword. The simple logic would be go through each line in the text and match it with our keyword. If it returns true print the filename .
After extensive googling I found 3 options to obtain the filename
Context.getConfiguration().get("map.input.file")
Context.getConfiguration().get("mapreduce.map.input.file")
Both methods returned a string with the value 'null' i.e they printed 'null' on my terminal screen.
Finally I tried this from site.google.com
public Path filesplit; filesplit=((FileSplit)context.getInputSplit()).getPath(); System.out.println(filesplit.getName())
The above method produced an error. The terminal output looked like this :-
java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to org.apache.hadoop.mapred.FileSplit
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to org.apache.hadoop.mapred.FileSplit
at org.myorg.DistributedGrep$GrepMapper.map(DistributedGrep.java:23)
at org.myorg.DistributedGrep$GrepMapper.map(DistributedGrep.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Can anybody suggest a remedy to these errors? What might have gone wrong and are there any mistakes on my part?
Or do you have any other ideas of obtaining the filename of the current line being executed at the mapper.
If you want to have a full look on the code here it is
package org.myorg;
import java.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class DistributedGrep {
public static class GrepMapper extends
Mapper<Object, Text, NullWritable, Text> {
public Path filesplit;
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
filesplit = ((FileSplit)context.getInputSplit()).getPath();
String txt = value.toString();
String mapRegex = context.getConfiguration().get("mapregex");
// System.out.println(context.getConfiguration().get("mapreduce.map.input.file");
System.out.println(filesplit.getName());
if (txt.matches(mapRegex)) {
System.out.println("Matched a line");
context.write(NullWritable.get(), value);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("mapregex", args[0]);
Job job = new Job(conf, "Distributed Grep");
job.setJarByClass(DistributedGrep.class);
job.setMapperClass(GrepMapper.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(0); // Set number of reducers to zero
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}