Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions

votes

4 answers

combiner and reducer can be different?

In many MapReduce programs, I see a reducer being used as a combiner as well. I know this is because of the specific nature of those programs. But I am wondering if they can be different.

mapreduce reducers combiners

asked Jul 31 '12 at 01:04

kee

10,969
24
107
168

votes

3 answers

Understanding LongWritable

I'm sorry if this is a foolish question, but I couldn't find answer with a Google search. How I can understand LongWritable type? What is it? Can anybody link to a schema or other helpful page.

java hadoop mapreduce

asked Jun 18 '12 at 15:41

Mijatovic

votes

2 answers

Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce?

Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce? Each mapper class work on a different set of inputs, but they would all emit key-value pairs consumed by the same reducer. Note that I'm not talking about…

hadoop mapreduce

asked Jun 16 '12 at 00:23

tibbe

8,809
7
36
64

votes

1 answer

What is the best way to run Map/Reduce stuff on data from Mongo?

I have a large Mongo database (100GB) hosted in the cloud (MongoLab or MongoHQ). I would like to run some Map/Reduce tasks on the data to compute some expensive statistics and was wondering what the best workflow is for getting this done. Ideally I…

mongodb hadoop amazon-s3 amazon-web-services mapreduce

asked Jun 12 '12 at 10:01

nickponline

25,354
32
99
167

votes

9 answers

Hadoop WordCount example stuck at map 100% reduce 0%

[hadoop-1.0.2] → hadoop jar hadoop-examples-1.0.2.jar wordcount /user/abhinav/input /user/abhinav/output Warning: $HADOOP_HOME is deprecated. ****hdfs://localhost:54310/user/abhinav/input 12/04/15 15:52:31 INFO input.FileInputFormat: Total…

hadoop mapreduce

asked Apr 15 '12 at 19:58

Abhinav Sharma

1,145
4
11
12

votes

3 answers

Re-use Amazon Elastic MapReduce instance

I have tried a simple Map/Reduce task using Amazon Elastic MapReduce and it took just 3 mins to complete the task. Is it possible to re-use the same instance to run another task. Even though I have just used the instance for 3 mins Amazon will…

amazon-ec2 mapreduce elastic-map-reduce

asked Jul 30 '11 at 00:27

Maggie

5,923
8
41
56

votes

1 answer

Streaming or custom Jar in Hadoop

I'm running a streaming job in Hadoop (on Amazon's EMR) with the mapper and reducer written in Python. I want to know about the speed gains I would experience if I implement the same mapper and reducer in Java (or use Pig). In particular, I'm…

java python streaming hadoop mapreduce

asked Jul 29 '11 at 12:29

Ruggiero Spearman

6,735
5
26
37

votes

2 answers

Are there MapReduce implementations on GPUs (CUDA)?

So far, I'm aware of the Mars, though what about alternatives?

cuda mapreduce gpu

asked Jun 14 '11 at 07:01

Nikita Zhiltsov

votes

1 answer

MapReduce shuffle/sort method

Somewhat of an odd question, but does anyone know what kind of sort MapReduce uses in the sort portion of shuffle/sort? I would think merge or insertion (in keeping with the whole MapReduce paradigm), but I'm not sure.

hadoop mapreduce hdfs

asked Apr 25 '11 at 15:05

SubSevn

1,008
2
10
27

votes

2 answers

Implementing PageRank using MapReduce

I'm trying to get my head around an issue with the theory of implementing the PageRank with MapReduce. I have the following simple scenario with three nodes: A B C. The adjacency matrix is here: A { B, C } B { A } The PageRank for B for example is…

algorithm mapreduce pagerank

asked Feb 17 '11 at 13:03

Nick D.

votes

3 answers

Should I learn/use MapReduce, or some other type of parallelization for this task?

After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset. This dataset mirrors a 3rd party service's data, so, to do the update, I need to make several remote calls to their API. I…

python amazon-web-services parallel-processing mapreduce

asked Nov 21 '10 at 03:25

Jordan Warbelow-Feldstein

10,510
12
48
79

votes

3 answers

Converting IntWritatble to int

I have the following code and I didn't understand why get() method has been used in the highlighted line. If I remove that get() method it throws me an error. What I can take out from it is: get() method returns the int value of the IntWritable.…

java hadoop mapreduce

asked May 17 '16 at 07:37

Sri

votes

3 answers

Change output filename prefix for DataFrame.write()

Output files generated via the Spark SQL DataFrame.write() method begin with the "part" basename prefix. e.g. DataFrame sample_07 = hiveContext.table("sample_07"); sample_07.write().parquet("sample_07_parquet"); Results in: hdfs dfs -ls…

java scala apache-spark apache-spark-sql mapreduce

asked Mar 19 '16 at 21:46

Rob

votes

1 answer

How are containers created based on vcores and memory in MapReduce2?

I have a tiny cluster composed of 1 master (namenode, secondarynamenode, resourcemanager) and 2 slaves (datanode, nodemanager). I have set in the yarn-site.xml of the master : yarn.scheduler.minimum-allocation-mb :…

hadoop mapreduce hadoop-yarn

asked Oct 13 '15 at 09:55

Nicomak

2,319
1
21
23

votes

3 answers

Could not find or load main class com.sun.tools.javac.Main hadoop mapreduce

I am trying to learn MapReduce but I am a little lost right now. http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Usage Particularly this set of instructions: Compile WordCount.java and…

java hadoop mapreduce hadoop2

asked Mar 25 '15 at 16:13

Liondancer

15,721
51
149
255

Prev 1 2 3

…

99 100 Next