I am using MultipleOutputs in my reduce program of my reduce phase. Data set that i am working on is around 270 mb and I am running this on my pseudo distributed single node. I have used custom writable for my map output values. keys are countries present in datasets.
public class reduce_class extends Reducer<Text, name, NullWritable, Text> {
public void reduce(Text key,Iterable<name> values,Context context) throws IOException, InterruptedException{
MultipleOutputs<NullWritable,Text> m = new MultipleOutputs<NullWritable,Text>(context);
long pat;
String n;
NullWritable out = NullWritable.get();
TreeMap<Long,ArrayList<String>> map = new TreeMap<Long,ArrayList<String>>();
for(name nn : values){
pat = nn.patent_No.get();
if(map.containsKey(pat))
map.get(pat).add(nn.getName().toString());
else{
map.put(pat,(new ArrayList<String>()));
map.get(pat).add(nn.getName().toString());}
}
for(Map.Entry entry : map.entrySet()){
n = entry.getKey().toString();
m.write(out, new Text("--------------------------"), key.toString());
m.write(out, new Text(n), key.toString());
ArrayList<String> names = (ArrayList)entry.getValue();
Iterator i = names.iterator();
while(i.hasNext()){
n = (String)i.next();
m.write(out, new Text(n), key.toString());
}
m.write(out, new Text("--------------------------"), key.toString());
}
m.close();
}
}
above is my reduce logic
problems
1) above code works fine with small data set but fails due to heap space with 270 mb data set.
2) Using country as key passes pretty large values in single iterable collection. I tried to solve this but MutlipleOutputs creates unique files for a given set of keys. Point is I am unable to append an already existing file created by previous run of reduce and throws error. thus for particular keys I have to create new files. Is there a way to work around this? . Solving above error caused me to define keys as country names(my final sorted data) but throws java heap error .
Sample Input
3858241,"Durand","Philip","E.","","","Hudson","MA","US","",1 3858241,"Norris","Lonnie","H.","","","Milford","MA","US","",2 3858242,"Gooding","Elwyn","R.","","120 Darwin Rd.","Pinckney","MI","US","48169",1 3858243,"Pierron","Claude","Raymond","","","Epinal","","FR","",1 3858243,"Jenny","Jean","Paul","","","Decines","","FR","",2 3858243,"Zuccaro","Robert","","","","Epinal","","FR","",3 3858244,"Mann","Richard","L.","","P.O. Box 69","Woodstock","CT","US","06281",1
Sample output for small datasets
sample directory structure...
CA-r-00000
FR-r-00000
Quebec-r-00000
TX-r-00000
US-r-00000
*Individual contents*
3858241 Philip E. Durand
Lonnie H. Norris
3858242
Elwyn R. Gooding
3858244