0

I am trying to implement a MapReduce job that processes a large text file (as a look up file) in addition to the actual dataset (input). the look up file is more than 2GB. I tried to load the text file as a third argument as follows:

but I got Java Heap Space Error.

After doing some search, it is suggested to use Distributed Cache. this is what I have done so far First, I used this method to read the look up file:

public static String readDistributedFile(Context context) throws IOException {
        URI[] cacheFiles = context.getCacheFiles();
        Path path = new Path(cacheFiles[0].getPath().toString());
        FileSystem fs = FileSystem.get(new Configuration());
        StringBuilder sb = new StringBuilder();
        BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
        String line;
        while ((line = br.readLine()) != null) {
            // split line
            sb.append(line);
            sb.append("\n");
        }
        br.close();
        return sb.toString();        
    }

Second, In the Mapper:

protected void setup(Context context)
                throws IOException, InterruptedException {
            super.setup(context);

            String lookUpText = readDistributedFile(context);
            //do something with the text
        }

Third, to run the job

hadoop jar mapReduceJob.jar the.specific.class -files ../LargeLookUpFileInStoredLocally.txt /user/name/inputdataset/*.gz /user/name/output

But the problem is that the job is taking long time to be load. May be it was not a good idea to use the distributed cache or may be I am missing something in my code.

I am working with Hadoop 2.5. I have already checked some related questions such as [1].

Any ideas will be great!

[1] Hadoop DistributedCache is deprecated - what is the preferred API?

Community
  • 1
  • 1
Daisy
  • 847
  • 3
  • 13
  • 34

1 Answers1

0

Distributed cache is mostly used to move the files which are needed by Map reduce at the task nodes and are not part of jar.

Other usage is when performing joins which include a big and small data set, so that, rather than using Multiple input paths, we use a single input(big) file, and get the other small file using distributed cache and then compare(or join) both the data sets.

The reason for more time in your case is because you are trying to read entire 2 gb file before the map reduce starts(as it is started in setup method).

Can you give the reason why you are loading the the huge 2gb file using distributed cache.

Ramzy
  • 6,948
  • 6
  • 18
  • 30
  • thanks a lot for your reply. I just needed a way to load an extra input other than my dataset. This 2G file will be processed in a different way than the dataset. when I load it locally (as a third argument) the class throws a Java Heap Space error. After searching some websites suggested this way. Do you know a better way? – Daisy Oct 20 '15 at 20:37
  • As I said above, Multiple INputs is another option. You will be having two mappers each handling different format. But you also need some attribute which is also there in your actual input. Eventually you are doing a join. Can you explain, what you are doing with that file and how you are linking it with actual input? – Ramzy Oct 20 '15 at 21:33
  • This large file is considered "extra read-only data needed by a job to process the main dataset". So according to the Hadoop: Definitive Guide Book this can be done either using "job configuration" or "distributed cache". My problem is that the file is really huge. – Daisy Oct 21 '15 at 09:00
  • yeah I get that. This usecase is mostly for small files or zipped or jar files. If you are good to move on with more time, it should be ok. Or else, based on the way you use this file, like look up is also a join, you can plan to use multiple inputs with 2 mappers and then perform your logic in reducer. As you know, main criteria for this is, you should have an exact field in your main input file too. – Ramzy Oct 21 '15 at 14:48
  • Thanks a lot for your reply. Basically, the look up txt file contains IDs of files that I want to delete from the original dataset. Later on this look up file might be around 13 GB so I think distributed cache won't be feasible. – Daisy Oct 22 '15 at 09:13