Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
21
votes
9 answers

MapReduce job hangs, waiting for AM container to be allocated

I tried to run simple word count as MapReduce job. Everything works fine when run locally (all work done on Name Node). But, when I try to run it on a cluster using YARN (adding mapreduce.framework.name=yarn to mapred-site.conf) job hangs. I came…
KaP
  • 387
  • 1
  • 2
  • 12
21
votes
1 answer

What is sequence file in hadoop?

I am new to Map-reduce and I want to understand what is sequence file data input? I studied in the Hadoop book but it was hard for me to understand.
Soghra Gargari
  • 401
  • 1
  • 4
  • 9
21
votes
3 answers

Difference between Application Manager and Application Master in YARN?

I understood how MRv1 works.Now I am trying to understand MRv2.. what's the difference between Application Manager and Application Master in YARN?
hadooper
  • 726
  • 1
  • 6
  • 18
21
votes
3 answers

Concatenate string values in array in a single field in MongoDB

Suppose that I have a series of documents with the following format: { "_id": "3_0", "values": ["1", "2"] } and I would like to obtain a projection of the array's values concatenated in a single field: { "_id": "3_0", "values":…
Eylen
  • 2,617
  • 4
  • 27
  • 42
21
votes
6 answers

MapReduce jobs get stuck in Accepted state

I have my own MapReduce code that I'm trying to run, but it just stays at Accepted state. I tried running another sample MR job that I'd run previously and which was successful. But now, both the jobs stay in Accepted state. I tried changing various…
user1571307
  • 335
  • 1
  • 2
  • 12
21
votes
5 answers

setup and cleanup methods of Mapper/Reducer in Hadoop MapReduce

Are setup and cleanup methods called in each mapper and reducer tasks respectively? Or are they called only once at the start of overall mapper and reducer jobs?
kee
  • 10,969
  • 24
  • 107
  • 168
20
votes
3 answers

About Hadoop/HDFS file splitting

Want to just confirm on following. Please verify if this is correct: 1. As per my understanding when we copy a file into HDFS, that is the point when file (assuming its size > 64MB = HDFS block size) is split into multiple chunks and each chunk is…
sunillp
  • 983
  • 3
  • 13
  • 31
20
votes
3 answers

Passing parameters to map function in Hadoop

I am new to Hadoop. I want to access a command line argument from main function(Java program) inside the map function of the mapper class. Please suggest ways to do this.
Pooja N Babu
  • 347
  • 3
  • 5
  • 15
20
votes
1 answer

CouchDB Reduce Check Box in Futon

I created a small test database in CouchDB and I'm creating a temporary view in Futon. I wrote the mapper and the reducer. The mapper works, but the check box for the reducer never shows up. I know that there should be a check box because I've seen…
Jason Marcell
  • 2,785
  • 5
  • 28
  • 41
20
votes
3 answers

Querying embedded objects in Mongoid/rails 3 ("Lower than", Min operators and sorting)

I am using rails 3 with mongoid. I have a collection of Stocks with an embedded collection of Prices : class Stock include Mongoid::Document field :name, :type => String field :code, :type => Integer embeds_many :prices class Price …
mathieurip
  • 547
  • 1
  • 6
  • 16
20
votes
3 answers

Fast way to find duplicates on indexed column in mongodb

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce. Or should I just iterate over all records and check for duplicates manually? My current…
Piotr Czapla
  • 25,734
  • 24
  • 99
  • 122
20
votes
1 answer

Using MongoDB's map/reduce to "group by" two fields

I need something slightly more complex than the examples in the MongoDB docs and I can't seem to be able to wrap my head around it. Say I have a collection of objects of the form {date: "2010-10-10", type: "EVENT_TYPE_1", user_id: 123, ...} Now I…
ibz
  • 44,461
  • 24
  • 70
  • 86
20
votes
3 answers

MongoDB allowDiskUse not working..

Experts. I'm new to MongoDB, but know enough to get my self in trouble.. case in point: db.test.aggregate( [ {$group: {_id: {email: "$email", gender: "$gender"}, cnt: {$sum: 1}}}, {$group: {_id: "$_id.email", cnt: {$sum: 1}}}, …
Eyal Zinder
  • 614
  • 1
  • 8
  • 21
20
votes
6 answers

Pig vs Hive vs Native Map Reduce

I've basic understanding on what Pig, Hive abstractions are. But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce. I went through few articles which basically points out that Hive is for structured processing…
Maverick
  • 484
  • 2
  • 9
  • 20
20
votes
7 answers

Hadoop input split size vs block size

I am going through hadoop definitive guide, where it clearly explains about input splits. It goes like Input splits doesn’t contain actual data, rather it has the storage locations to data on HDFS and Usually,Size of Input split is same as…
rohith
  • 733
  • 4
  • 10
  • 24