I'm working on an Hadoop project. My reduce phase is very memory expensive. I'm currently using an HashMap, but I get the Error: Java Heap space
because in Reduce I build a huge hashmap (the dataset is 32GB). The solution could be an in-memory Hashmap with disk fallback, and MapDB it seems fits my needs.
But I'm not sure about the usage. The diskMap is unique for each Reduce Task, the inMemory map is unique for each reduce 'key'. Even if I set expireMaxSize(3)
for testing I'm not sure when the onDisk map is used and if the logic is correct. Again, for testing I fill the hashmap with 20 fake entries.
Basically, in order to avoid heap overflow, I need to control the inMemory map growth.
public class TestReducer extends Reducer<LongWritable, BytesWritable, String, IntWritable> {
private int id;
DB dbDisk;
protected void setup(Context context) throws IOException, InterruptedException {
id = context.getTaskAttemptID().getTaskID().getId();
File diskmap = new File("tmp/diskmap"+id);
diskmap.delete();
dbDisk = DBMaker
.fileDB("tmp/diskmap"+id)
.make();
}
@Override
protected void reduce(LongWritable key, Iterable<BytesWritable> values, Context context)
throws IOException, InterruptedException {
DB dbMemory = DBMaker
.memoryDB()
.make();
HTreeMap<Long,Integer> onDisk = dbDisk
.hashMap("onDisk")
.keySerializer(Serializer.LONG)
.valueSerializer(Serializer.INTEGER)
.createOrOpen();
// fast in-memory collection with limited size
HTreeMap<Long,Integer> inMemory = dbMemory
.hashMap("inMemory")
.expireMaxSize(3)
.keySerializer(Serializer.LONG)
.valueSerializer(Serializer.INTEGER)
//this registers overflow to `onDisk`
.expireOverflow(onDisk)
.createOrOpen();
for(int k=0;k<20;k++){
inMemory.put((long)k,k*2);
}
Set set = inMemory.entrySet();
Iterator it = set.iterator();
while(it.hasNext()) {
Map.Entry<Long,Integer> entry = (Map.Entry<Long,Integer>)it.next();
System.out.print("Key is: "+entry.getKey() + " & ");
System.out.println("Value is: "+entry.getValue());
}
}
protected void cleanup(Context context) throws IOException,InterruptedException {
dbDisk.close();
}
}