Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
18
votes
2 answers

When does shuffling occur in Apache Spark?

I am optimizing parameters in Spark, and would like to know exactly how Spark is shuffling data. Precisely, I have a simple word count program, and would like to know how spark.shuffle.file.buffer.kb is affecting the run time. Right now, I only see…
cnnrznn
  • 435
  • 1
  • 4
  • 11
18
votes
1 answer

yarn is not honouring yarn.nodemanager.resource.cpu-vcores

I am using Hadoop-2.4.0 and my system configs are 24 cores, 96 GB RAM. I am using following…
banjara
  • 3,800
  • 3
  • 38
  • 61
18
votes
4 answers

Computing median in map reduce

Can someone example the computation of median/quantiles in map reduce? My understanding of Datafu's median is that the 'n' mappers sort the data and send the data to "1" reducer which is responsible for sorting all the data from n mappers and…
learner
  • 885
  • 3
  • 14
  • 28
17
votes
3 answers

Can someone explain map-reduce in C#?

Can anyone please explain the concept of map-reduce, particularly in Mongo? I also use C# so any specifics in that area would also be useful.
Rawhi
  • 6,155
  • 8
  • 36
  • 57
17
votes
3 answers

Secondary Sort in Hadoop

I am working on a hadoop project and after many visit to various blogs and reading the documentation, I realized I need to use secondary sort feature provided by hadoop framework. My input format is of the form: DESC(String) Price(Integer) and some…
Abhishek Singh
  • 275
  • 1
  • 2
  • 18
17
votes
2 answers

What's the successor of mrunit?

Today I found out that the ASF retired mrunit (see https://blogs.apache.org/foundation/entry/the_apache_news_round_up85 and https://issues.apache.org/jira/browse/HADOOP-3733 and the homepage itself). Other than "inactivity" there was no reason…
David Ongaro
  • 3,568
  • 1
  • 24
  • 36
17
votes
1 answer

Does Mongoid have Map/Reduce?

I am using Ruby code to calculate sum from the array returned by Mongoid. But maybe using Map/Reduce can be faster, except I don't see any docs for Map Reduce on mongoid.org and Google for map reduce site:mongoid.org doesn't give any result either.…
nonopolarity
  • 146,324
  • 131
  • 460
  • 740
17
votes
3 answers

Writable and WritableComparable in Hadoop?

Could anyone please explain me that: What is Writable and Writable Comparable interface in Hadoop? What is different between these two? Please explain with example. Thanks in Advance,
vipin chourasia
  • 211
  • 1
  • 3
  • 8
17
votes
9 answers

Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/base/Preconditions

While running by java map reduce application in eclipse, and facing the below exception. I have included the commons-logging-1.2.jar file in my build path also, but still below is coming. I am new to hadoop. Kindly help me out. Exception in thread…
JGS
  • 369
  • 2
  • 5
  • 17
17
votes
2 answers

Hadoop MapReduce vs MPI (vs Spark vs Mahout vs Mesos) - When to use one over the other?

I am new to parallel computing and just starting to try out MPI and Hadoop+MapReduce on Amazon AWS. But I am confused about when to use one over the other. For example, one common rule of thumb advice I see can be summarized as... Big data,…
GuSuku
  • 1,371
  • 1
  • 14
  • 30
17
votes
1 answer

Difference between HDFS and NFS?

I am newbie on this.Would like to know about the basic differences between hadoop distributed file system and network file system and what are the benefits of hdfs over nfs?
Alok Pathak
  • 875
  • 1
  • 8
  • 20
17
votes
3 answers

It is possible to start an embedded instance of apache Spark node?

I want to start an instance of a standalone Apache Spark cluster embedded into my java app. I tried to find some documentation at their website but not look yet. Is this possible?
Rodrigo
  • 195
  • 1
  • 10
17
votes
1 answer

Top N values by Hadoop Map Reduce code

I am very new in hadoop world and struggling to achieve one simple task. Can anybody please tell me how to get top n values for word count example by using only Map reduce code technique? I do not want to use any hadoop command for this simple…
user3078014
  • 171
  • 1
  • 1
  • 4
17
votes
4 answers

Out of memory error in Mapreduce shuffle phase

I am getting strange errors while running a wordcount-like mapreduce program. I have a hadoop cluster with 20 slaves, each having 4 GB RAM. I configured my map tasks to have a heap of 300MB and my reduce task slots get 1GB. I have 2 map slots and 1…
DDW
  • 1,975
  • 2
  • 13
  • 26
17
votes
1 answer

How to get the document with max value for a field with map-reduce in pymongo?

How do I find the document with the maximum uid field with map-reduce in pymongo? I have tried the following but it prints out blanks: from pymongo import Connection from bson.code import Code db = Connection().map_reduce_example db.things.insert({…
alvas
  • 115,346
  • 109
  • 446
  • 738