1

i'm new to big data and hadoop thing. I try to find median with mapreduce. From what i know, mapper pass data to 1 reducer then 1 reducer sort and find the middle value using median() function.

R running in memmory, so what if data too big to store in 1 reducer, which is running on 1 computer?

here is the example of my code to find median with RHadoop.

map <- function(k,v) {
    key <- "median"
    keyval(key, v)
}
reduce <- function(k,v) {
    keyval(k, median(v))
}

medianMR <- mapreduce (
    input= random, output="/tmp/ex3",
    map = map, reduce = reduce
)

1 Answers1

1

Dependent on situation, If we set the number of Reducer to 0 (by setting job. setNumreduceTasks(0)), then no reducer will execute and no aggregation will take place. the map does all task with its InputSplit and the reducer do no job.

In your case, this is also dependent on wether you want to find median in a series which may call for more than 1 reducer. Depending on the range and uniqueness of values in your input set, you could introduce a combiner to output the frequency of each value - reducing the number of map outputs sent to your single reducer. Your reducer can then consume the sort value / frequency pairs to identify the median.

Another aproach, if you think your data is too chunky for 1 reducer, is custom partitioner. This distributes the keys by range buckets (0-1000 go to reducer 1, 1001-2000 to reducer 3, ...reducer n). This will warrant some secondary job to analyse the reducer outputs and perform the final median calculation (knowing for example the number of keys in each reducer, you can calculate which reducer output will contain the median.

You can take a look at this answer which might helpful - number of reducers for 1 task in MapReduce

linkonabe
  • 661
  • 7
  • 23