Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
80
votes
15 answers

Count lines in large files

I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often. The way I do it now it's just cat fname | wc -l, and it takes very long. Is there any solution that'd be much faster? I work…
Dnaiel
  • 7,622
  • 23
  • 67
  • 126
79
votes
2 answers

Hadoop truncated/inconsistent counter name

For now, I have a Hadoop job which creates counters with a pretty big name. For example, the following one:…
mr.nothing
  • 5,141
  • 10
  • 53
  • 77
79
votes
11 answers

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

I am getting: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask While trying to make a copy of a partitioned table using the commands in the hive console: CREATE TABLE copy_table_name LIKE table_name; INSERT…
nickponline
  • 25,354
  • 32
  • 99
  • 167
77
votes
10 answers

merge output files after reduce phase

In mapreduce each reduce task write its output to a file named part-r-nnnnn where nnnnn is a partition ID associated with the reduce task. Does map/reduce merge these files? If yes, how?
Shahryar
  • 1,454
  • 2
  • 15
  • 32
71
votes
6 answers

Integration testing Hive jobs

I'm trying to write a non-trivial Hive job using the Hive Thrift and JDBC interfaces, and I'm having trouble setting up a decent JUnit test. By non-trivial, I mean that the job results in at least one MapReduce stage, as opposed to only dealing with…
yoni
  • 5,686
  • 3
  • 27
  • 28
62
votes
5 answers

Find all duplicate documents in a MongoDB collection by a key field

Suppose I have a collection with some set of documents. something like this. { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"} { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"} { "_id" :…
frazman
  • 32,081
  • 75
  • 184
  • 269
57
votes
2 answers

Is Mongodb Aggregation framework faster than map/reduce?

Is the aggregation framework introduced in mongodb 2.2, has any special performance improvements over map/reduce? If yes, why and how and how much? (Already I have done a test for myself, and the performance was nearly same)
Taha Jahangir
  • 4,774
  • 2
  • 42
  • 49
55
votes
5 answers

Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)

I want to debug a mapreduce script, and without going into much trouble tried to put some print statements in my program. But I cant seem to find them in any of the logs.
jason
  • 3,471
  • 6
  • 30
  • 43
52
votes
10 answers

Reduce a key-value pair into a key-list pair with Apache Spark

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the reduceByKey function with something…
TravisJ
  • 1,592
  • 1
  • 21
  • 37
52
votes
2 answers

Difference between Fork/Join and Map/Reduce

What is the key difference between Fork/Join and Map/Reduce? Do they differ in the kind of decomposition and distribution (data vs. computation)?
hotzen
  • 2,800
  • 1
  • 28
  • 42
51
votes
8 answers

data block size in HDFS, why 64MB?

The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB. What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB? If yes, what is the advantage of doing that?-> easy…
dykw
  • 1,199
  • 3
  • 13
  • 17
49
votes
10 answers

Simple Java Map/Reduce framework

Can anyone point me at a simple, open-source Map/Reduce framework/API for Java? There doesn't seem to much evidence of such a thing existing, but someone else might know different. The best I can find is, of course, Hadoop MapReduce, but that fails…
skaffman
  • 398,947
  • 96
  • 818
  • 769
49
votes
3 answers

Check if every element in array matches condition

I have a collection of documents: date: Date users: [ { user: 1, group: 1 } { user: 5, group: 2 } ] date: Date users: [ { user: 1, group: 1 } { user: 3, group: 2 } ] I would like to query against this collection to find all documents where…
Wex
  • 15,539
  • 10
  • 64
  • 107
49
votes
1 answer

Is gzip format supported in Spark?

For a Big Data project, I'm planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of HDFS. However, in the official documentation, I can't find any hint as to how…
ptikobj
  • 2,690
  • 7
  • 39
  • 64
48
votes
9 answers

What is a container in YARN?

What is a container in YARN? Is it same as the child JVM in which the tasks on the nodemanager run or is it different?
rahul
  • 1,423
  • 3
  • 18
  • 28