0

We have a job that runs on a single node taking up to 40m to complete, and with M/R we hope to get that down to less than 2m, but we're not sure what parts of the process go into map() and reduce().

Current Process:
For a list of keys, call a web service for each key and get xml response; transform xml into pipe-delimited format; output a single file in the end...

def keys = 100..9999
def output = new StringBuffer()
keys.each(){ key -> 
   def xmlResponse = callRemoteService( key)
   def transformed = convertToPipeDelimited( xmlResponse)
   output.append( transformed)
}
file.write( output)

Map/Reduce Model
Here's how I modeled it with map/reduce, just want to make sure I'm on the right path...

Mapper
The keys get pulled from keys.txt; I call the remote service for each key and store key/xml pair...

public static class XMLMapper extends Mapper<Text, Text, Text, Text> {
        private Text xml = new Text();
        public void map(Text key, Text value, Context context){          
           String xmlResponse = callRemoteService( key)
           xml.set( xmlResponse)
           context.write(key, xml);
        }
    }

Reducer
For each key/xml pair, I transform the xml to pipe-delimited format, then write out the result...

public static class XMLToPipeDelimitedReducer extends Reducer<Text,Text,Text,Text> {
        private Text result = new Text();
        public void reduce(Text key, Iterable<Text> values, Context context ) { 
            String xml = values.iterator().next();
            String transformed = convertToPipeDelimited( xml);   
            result.set( transformed);
            context.write( key, result);
        }
    }

Questions

  • Is it good practice to call the web service in map() while doing the transform in reduce(); any benefits from doing both operations in map()?
  • I don't check for duplicates in reduce() because keys.txt contains no duplicate keys; is that safe?
  • How can I control the format of the output file? TextOutputFormat looks interesting; I want it to read like this...
100|foo bar|$456,098
101|bar foo|$20,980
raffian
  • 31,267
  • 26
  • 103
  • 174

1 Answers1

1

You should do the transform map-side, for a couple of reasons:

  • Turning from xml to pipe-delimited will reduce the amount of data you're serializing and transmitting into the reducer.
  • You will be running multiple map jobs, but a single reduce job, so you want to transform map-side to take advantage of that parallelism.
  • Since all the work is map-side, you can just use the provided IdentityReducer and not have to write your own code for that.

If you want a single output file, you'll want to use a single reducer; map-reduce produces one output file per reducer.

If you're sure there are no duplicate keys, then yes, it should be safe to ignore duplicates reduce-side.

I believe TextOutputFormat will by default write your (key, value) pairs to file as a tab-separated string, so not quite the format you want. See here for how you might change that.

Your webservice is going to be one limiting factor here. Assuming you want your 40-minute job to run in 2 minutes, you'll probably want 40 or so map jobs reading from it. Can it handle 40 concurrent readers?

Your other limiting factor is going to be the reduce-side. Assuming you want a single output file sorted by key, you're going to have to use a single reducer, and it will have to sort all your input, which could take a little bit.

Once you have your code working, you'll have to run some experiments and see what settings give you a reasonable run-time. Good luck.

Community
  • 1
  • 1
DPM
  • 1,571
  • 11
  • 8
  • The output doesn't have to be sorted. So `map()` is executed in parallel, but `reduce()`, in this case, is not because I need a single output file? How do I configure M/R to use a single reducer? – raffian Jul 17 '13 at 15:42
  • Output that passes through a reducer is always sorted. If you need a single output file, you'll have to set a reducer class (even just IdentityReducer), and it defaults to a single reducer instance. That number can be overridden in your driver class on the job configuration. – DPM Jul 17 '13 at 16:09