0

I'm working on an Hadoop project. My reduce phase is very memory expensive. I'm currently using an HashMap, but I get the Error: Java Heap space because in Reduce I build a huge hashmap (the dataset is 32GB). The solution could be an in-memory Hashmap with disk fallback, and MapDB it seems fits my needs. But I'm not sure about the usage. The diskMap is unique for each Reduce Task, the inMemory map is unique for each reduce 'key'. Even if I set expireMaxSize(3) for testing I'm not sure when the onDisk map is used and if the logic is correct. Again, for testing I fill the hashmap with 20 fake entries. Basically, in order to avoid heap overflow, I need to control the inMemory map growth.

public class TestReducer extends Reducer<LongWritable, BytesWritable, String, IntWritable> {

private int id;
DB dbDisk;
protected void setup(Context context) throws IOException, InterruptedException {
    id = context.getTaskAttemptID().getTaskID().getId();
    File diskmap = new File("tmp/diskmap"+id);
    diskmap.delete();
    dbDisk = DBMaker
                .fileDB("tmp/diskmap"+id)
                .make();
}

@Override
protected void reduce(LongWritable key, Iterable<BytesWritable> values, Context context)
        throws IOException, InterruptedException {

    DB dbMemory = DBMaker
            .memoryDB()
            .make();

    HTreeMap<Long,Integer> onDisk = dbDisk
            .hashMap("onDisk")
            .keySerializer(Serializer.LONG)
            .valueSerializer(Serializer.INTEGER)
            .createOrOpen();
    // fast in-memory collection with limited size
    HTreeMap<Long,Integer> inMemory = dbMemory
            .hashMap("inMemory")
            .expireMaxSize(3)
            .keySerializer(Serializer.LONG)
            .valueSerializer(Serializer.INTEGER)
            //this registers overflow to `onDisk`
            .expireOverflow(onDisk)
            .createOrOpen();

    for(int k=0;k<20;k++){
        inMemory.put((long)k,k*2);
    }
Set set = inMemory.entrySet();
    Iterator it = set.iterator();
    while(it.hasNext()) {
      Map.Entry<Long,Integer> entry = (Map.Entry<Long,Integer>)it.next();
      System.out.print("Key is: "+entry.getKey() + " & ");
      System.out.println("Value is: "+entry.getValue());
    }

}
protected void cleanup(Context context) throws IOException,InterruptedException {
    dbDisk.close();
}

}
alfredopacino
  • 2,979
  • 9
  • 42
  • 68
  • I never used MapDB, but it would be easy to use SQLite .. its easy to use and should be the right solution for what you are trying to do. – vgunnu Aug 31 '16 at 07:19

1 Answers1

0

MapDB can allocate memory either in direct memory or on the heap of your application.

In order to do use direct memory you need to replace the

DB dbMemory = DBMaker
             .memoryDB()
             .make();

with

DB dbMemory = DBMaker
              .memoryDirectDB()
              .make();

There is a java property

XX:MaxDirectMemorySize

that you can set to set the maximum memory it will use.

You will still need to manage your allocations so that you have enough memory for your data, but your application's heap won't grow with this data and your application itself will not throw the out of memory exception or hit the max heap limit (unless the application is evil).

Alwyn Schoeman
  • 467
  • 7
  • 13