I am trying to implement a MapReduce job that processes a large text file (as a look up file) in addition to the actual dataset (input). the look up file is more than 2GB. I tried to load the text file as a third argument as follows:
but I got Java Heap Space Error.
After doing some search, it is suggested to use Distributed Cache. this is what I have done so far First, I used this method to read the look up file:
public static String readDistributedFile(Context context) throws IOException {
URI[] cacheFiles = context.getCacheFiles();
Path path = new Path(cacheFiles[0].getPath().toString());
FileSystem fs = FileSystem.get(new Configuration());
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
String line;
while ((line = br.readLine()) != null) {
// split line
sb.append(line);
sb.append("\n");
}
br.close();
return sb.toString();
}
Second, In the Mapper:
protected void setup(Context context)
throws IOException, InterruptedException {
super.setup(context);
String lookUpText = readDistributedFile(context);
//do something with the text
}
Third, to run the job
hadoop jar mapReduceJob.jar the.specific.class -files ../LargeLookUpFileInStoredLocally.txt /user/name/inputdataset/*.gz /user/name/output
But the problem is that the job is taking long time to be load. May be it was not a good idea to use the distributed cache or may be I am missing something in my code.
I am working with Hadoop 2.5. I have already checked some related questions such as [1].
Any ideas will be great!
[1] Hadoop DistributedCache is deprecated - what is the preferred API?