We have a job that runs on a single node taking up to 40m to complete, and with M/R we hope to get that down to less than 2m, but we're not sure what parts of the process go into map()
and reduce()
.
Current Process:
For a list of keys, call a web service for each key and get xml response; transform xml into pipe-delimited format; output a single file in the end...
def keys = 100..9999
def output = new StringBuffer()
keys.each(){ key ->
def xmlResponse = callRemoteService( key)
def transformed = convertToPipeDelimited( xmlResponse)
output.append( transformed)
}
file.write( output)
Map/Reduce Model
Here's how I modeled it with map/reduce, just want to make sure I'm on the right path...
Mapper
The keys get pulled from keys.txt; I call the remote service for each key and store key/xml pair...
public static class XMLMapper extends Mapper<Text, Text, Text, Text> {
private Text xml = new Text();
public void map(Text key, Text value, Context context){
String xmlResponse = callRemoteService( key)
xml.set( xmlResponse)
context.write(key, xml);
}
}
Reducer
For each key/xml pair, I transform the xml to pipe-delimited format, then write out the result...
public static class XMLToPipeDelimitedReducer extends Reducer<Text,Text,Text,Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values, Context context ) {
String xml = values.iterator().next();
String transformed = convertToPipeDelimited( xml);
result.set( transformed);
context.write( key, result);
}
}
Questions
- Is it good practice to call the web service in
map()
while doing the transform inreduce()
; any benefits from doing both operations inmap()
? - I don't check for duplicates in
reduce()
because keys.txt contains no duplicate keys; is that safe? - How can I control the format of the output file?
TextOutputFormat
looks interesting; I want it to read like this...
100|foo bar|$456,098 101|bar foo|$20,980