Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions

votes

15 answers

Count lines in large files

I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often. The way I do it now it's just cat fname | wc -l, and it takes very long. Is there any solution that'd be much faster? I work…

linux shell mapreduce

asked Oct 03 '12 at 20:42

Dnaiel

7,622
23
67
126

votes

2 answers

Hadoop truncated/inconsistent counter name

For now, I have a Hadoop job which creates counters with a pretty big name. For example, the following one:…

java hadoop mapreduce hadoop-yarn

asked Jan 17 '17 at 15:32

mr.nothing

5,141
10
53
77

votes

11 answers

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

I am getting: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask While trying to make a copy of a partitioned table using the commands in the hive console: CREATE TABLE copy_table_name LIKE table_name; INSERT…

hadoop mapreduce hive

asked Jun 25 '12 at 08:04

nickponline

25,354
32
99
167

votes

10 answers

merge output files after reduce phase

In mapreduce each reduce task write its output to a file named part-r-nnnnn where nnnnn is a partition ID associated with the reduce task. Does map/reduce merge these files? If yes, how?

hadoop mapreduce

asked Apr 18 '11 at 08:01

Shahryar

1,454
2
15
32

votes

6 answers

Integration testing Hive jobs

I'm trying to write a non-trivial Hive job using the Hive Thrift and JDBC interfaces, and I'm having trouble setting up a decent JUnit test. By non-trivial, I mean that the job results in at least one MapReduce stage, as opposed to only dealing with…

java testing hadoop mapreduce hive

asked May 23 '13 at 16:47

yoni

5,686
3
27
28

votes

5 answers

Find all duplicate documents in a MongoDB collection by a key field

Suppose I have a collection with some set of documents. something like this. { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"} { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"} { "_id" :…

mongodb mapreduce duplicates aggregation-framework

asked Feb 29 '12 at 00:23

frazman

32,081
75
184
269

votes

2 answers

Is Mongodb Aggregation framework faster than map/reduce?

Is the aggregation framework introduced in mongodb 2.2, has any special performance improvements over map/reduce? If yes, why and how and how much? (Already I have done a test for myself, and the performance was nearly same)

performance mongodb mapreduce aggregation-framework

asked Dec 17 '12 at 04:53

Taha Jahangir

4,774
2
42
49

votes

5 answers

Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)

I want to debug a mapreduce script, and without going into much trouble tried to put some print statements in my program. But I cant seem to find them in any of the logs.

hadoop mapreduce

asked Jul 08 '10 at 19:34

jason

3,471
6
30
43

votes

10 answers

Reduce a key-value pair into a key-list pair with Apache Spark

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the reduceByKey function with something…

python apache-spark mapreduce pyspark rdd

asked Nov 18 '14 at 19:15

TravisJ

1,592
1
21
37

votes

2 answers

Difference between Fork/Join and Map/Reduce

What is the key difference between Fork/Join and Map/Reduce? Do they differ in the kind of decomposition and distribution (data vs. computation)?

mapreduce fork-join

asked Mar 29 '10 at 13:34

hotzen

2,800
1
28
42

votes

8 answers

data block size in HDFS, why 64MB?

The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB. What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB? If yes, what is the advantage of doing that?-> easy…

database hadoop mapreduce block hdfs

asked Oct 20 '13 at 03:56

dykw

1,199
3
13
17

votes

10 answers

Simple Java Map/Reduce framework

Can anyone point me at a simple, open-source Map/Reduce framework/API for Java? There doesn't seem to much evidence of such a thing existing, but someone else might know different. The best I can find is, of course, Hadoop MapReduce, but that fails…

java mapreduce

asked Mar 10 '11 at 13:24

skaffman

398,947
96
818
769

votes

3 answers

Check if every element in array matches condition

I have a collection of documents: date: Date users: [ { user: 1, group: 1 } { user: 5, group: 2 } ] date: Date users: [ { user: 1, group: 1 } { user: 3, group: 2 } ] I would like to query against this collection to find all documents where…

mongodb mapreduce mongodb-query aggregation-framework

asked May 11 '14 at 16:21

Wex

15,539
10
64
107

votes

1 answer

Is gzip format supported in Spark?

For a Big Data project, I'm planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of HDFS. However, in the official documentation, I can't find any hint as to how…

java scala mapreduce gzip apache-spark

asked Apr 30 '13 at 14:30

ptikobj

2,690
7
39
64

votes

9 answers

What is a container in YARN?

What is a container in YARN? Is it same as the child JVM in which the tasks on the nodemanager run or is it different?

hadoop mapreduce hadoop-yarn

asked Jan 16 '13 at 18:28

rahul

1,423
3
18
28

Prev 1

…

99 100 Next