Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
12
votes
4 answers

combiner and reducer can be different?

In many MapReduce programs, I see a reducer being used as a combiner as well. I know this is because of the specific nature of those programs. But I am wondering if they can be different.
kee
  • 10,969
  • 24
  • 107
  • 168
12
votes
3 answers

Understanding LongWritable

I'm sorry if this is a foolish question, but I couldn't find answer with a Google search. How I can understand LongWritable type? What is it? Can anybody link to a schema or other helpful page.
Mijatovic
  • 229
  • 1
  • 3
  • 7
12
votes
2 answers

Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce?

Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce? Each mapper class work on a different set of inputs, but they would all emit key-value pairs consumed by the same reducer. Note that I'm not talking about…
tibbe
  • 8,809
  • 7
  • 36
  • 64
12
votes
1 answer

What is the best way to run Map/Reduce stuff on data from Mongo?

I have a large Mongo database (100GB) hosted in the cloud (MongoLab or MongoHQ). I would like to run some Map/Reduce tasks on the data to compute some expensive statistics and was wondering what the best workflow is for getting this done. Ideally I…
nickponline
  • 25,354
  • 32
  • 99
  • 167
12
votes
9 answers

Hadoop WordCount example stuck at map 100% reduce 0%

[hadoop-1.0.2] → hadoop jar hadoop-examples-1.0.2.jar wordcount /user/abhinav/input /user/abhinav/output Warning: $HADOOP_HOME is deprecated. ****hdfs://localhost:54310/user/abhinav/input 12/04/15 15:52:31 INFO input.FileInputFormat: Total…
Abhinav Sharma
  • 1,145
  • 4
  • 11
  • 12
11
votes
3 answers

Re-use Amazon Elastic MapReduce instance

I have tried a simple Map/Reduce task using Amazon Elastic MapReduce and it took just 3 mins to complete the task. Is it possible to re-use the same instance to run another task. Even though I have just used the instance for 3 mins Amazon will…
Maggie
  • 5,923
  • 8
  • 41
  • 56
11
votes
1 answer

Streaming or custom Jar in Hadoop

I'm running a streaming job in Hadoop (on Amazon's EMR) with the mapper and reducer written in Python. I want to know about the speed gains I would experience if I implement the same mapper and reducer in Java (or use Pig). In particular, I'm…
Ruggiero Spearman
  • 6,735
  • 5
  • 26
  • 37
11
votes
2 answers

Are there MapReduce implementations on GPUs (CUDA)?

So far, I'm aware of the Mars, though what about alternatives?
Nikita Zhiltsov
  • 654
  • 9
  • 15
11
votes
1 answer

MapReduce shuffle/sort method

Somewhat of an odd question, but does anyone know what kind of sort MapReduce uses in the sort portion of shuffle/sort? I would think merge or insertion (in keeping with the whole MapReduce paradigm), but I'm not sure.
SubSevn
  • 1,008
  • 2
  • 10
  • 27
11
votes
2 answers

Implementing PageRank using MapReduce

I'm trying to get my head around an issue with the theory of implementing the PageRank with MapReduce. I have the following simple scenario with three nodes: A B C. The adjacency matrix is here: A { B, C } B { A } The PageRank for B for example is…
Nick D.
  • 111
  • 1
  • 1
  • 4
11
votes
3 answers

Should I learn/use MapReduce, or some other type of parallelization for this task?

After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset. This dataset mirrors a 3rd party service's data, so, to do the update, I need to make several remote calls to their API. I…
11
votes
3 answers

Converting IntWritatble to int

I have the following code and I didn't understand why get() method has been used in the highlighted line. If I remove that get() method it throws me an error. What I can take out from it is: get() method returns the int value of the IntWritable.…
Sri
  • 177
  • 2
  • 2
  • 10
11
votes
3 answers

Change output filename prefix for DataFrame.write()

Output files generated via the Spark SQL DataFrame.write() method begin with the "part" basename prefix. e.g. DataFrame sample_07 = hiveContext.table("sample_07"); sample_07.write().parquet("sample_07_parquet"); Results in: hdfs dfs -ls…
Rob
  • 113
  • 1
  • 1
  • 4
11
votes
1 answer

How are containers created based on vcores and memory in MapReduce2?

I have a tiny cluster composed of 1 master (namenode, secondarynamenode, resourcemanager) and 2 slaves (datanode, nodemanager). I have set in the yarn-site.xml of the master : yarn.scheduler.minimum-allocation-mb :…
Nicomak
  • 2,319
  • 1
  • 21
  • 23
11
votes
3 answers

Could not find or load main class com.sun.tools.javac.Main hadoop mapreduce

I am trying to learn MapReduce but I am a little lost right now. http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Usage Particularly this set of instructions: Compile WordCount.java and…
Liondancer
  • 15,721
  • 51
  • 149
  • 255