Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions

votes

7 answers

In MongoDB mapreduce, how can I flatten the values object?

I'm trying to use MongoDB to analyse Apache log files. I've created a receipts collection from the Apache access logs. Here's an abridged summary of what my models look like: db.receipts.findOne() { "_id" : ObjectId("4e57908c7a044a30dc03a888"), …

mongodb mapreduce

asked Aug 31 '11 at 13:55

nelstrom

18,802
13
54
70

votes

6 answers

Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me. Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows: Write a mapper function that…

java hadoop mapreduce

asked Sep 02 '10 at 06:46

Chander Shivdasani

9,878
20
76
107

votes

8 answers

What type of problems can mapreduce solve?

Is there a theoretical analysis available which describes what kind of problems mapreduce can solve?

parallel-processing mapreduce

asked Apr 01 '09 at 12:40

amit-agrawal

1,563
2
13
24

votes

1 answer

Accessing stream output from hdfs of MRjob

I'm trying to use a Python driver to run an iterative MRjob program. The exit criteria depend on a counter. The job itself seems to run. If I run a single iteration from the command line, I can then hadoop fs -cat /user/myname/myhdfsdir/part-00000…

python hadoop mapreduce hdfs mrjob

asked Mar 25 '18 at 04:10

tony_tiger

votes

2 answers

Cognitive Complexity and its effect on the code

W.r.t to one of the java projects, we recently started using SonarLint. Output of the code analysis shows too many critical code smell alerts. Critical code smell: Refactor this method to reduce its Cognitive Complexity. I have heard about…

java algorithm mapreduce refactoring sonarlint

asked Oct 10 '17 at 18:20

vmorusu

votes

7 answers

20 Billion Rows/Month - Hbase / Hive / Greenplum / What?

I'd like to use your wisdom for picking up the right solution for a data-warehouse system. Here are some details to better understand the problem: Data is organized in a star schema structure with one BIG fact and ~15 dimensions. 20B fact rows…

database mapreduce data-warehouse greenplum vldb

asked Dec 09 '08 at 21:05

Haggai

votes

5 answers

What are some scenarios for which MPI is a better fit than MapReduce?

As far as I understand, MPI gives me much more control over how exactly different nodes in the cluster will communicate. In MapReduce/Hadoop, each node does some computation, exchanges data with other nodes, and then collates its partition of…

parallel-processing distributed mapreduce mpi

asked Oct 07 '09 at 09:22

Igor ostrovsky

7,282
2
29
28

votes

1 answer

Hadoop speculative task execution

In Google's MapReduce paper, they have a backup task, I think it's the same thing with speculative task in Hadoop. How is the speculative task implemented? When I start a speculative task, does the task start from the very begining as the older and…

hadoop mapreduce

asked Mar 01 '13 at 18:56

lil

2,527
4
22
15

votes

6 answers

No such method exception Hadoop

When I am running a Hadoop .jar file from the command prompt, it throws an exception saying no such method StockKey method. StockKey is my custom class defined for my own type of key. Here is the exception: 12/07/12 00:18:47 INFO mapred.JobClient:…

java hadoop mapreduce

asked Jul 12 '12 at 07:08

London guy

27,522
44
121
179

votes

4 answers

What is the purpose of "uber mode" in hadoop?

Hi I am a big data newbie. I searched all over the internet to find what exactly uber mode is. The more I searched the more I got confused. Can anybody please help me by answering my questions? What does uber mode do? Does it works differently in…

hadoop mapreduce

asked May 17 '15 at 06:58

Mohammed Asad

votes

4 answers

Change File Split size in Hadoop

I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. That is, a 64mb file, which is the default split size for TextInputFormat, would take even…

java hadoop mapreduce distributed-computing

asked Mar 13 '12 at 04:01

Ahmadov

1,567
5
31
48

votes

1 answer

Map-Reduce performance in MongoDb 2.2, 2.4, and 2.6

I've found this discussion: MongoDB: Terrible MapReduce Performance. Basically it says try to avoid Mongo's MR queries as it single-threaded and not supposed to be for real-time at all. 2 years has passed, and I wonder what has been changed since…

mongodb mapreduce

asked Oct 01 '12 at 18:10

YMC

4,925
7
53
83

votes

8 answers

MapReduce implementation in Scala

I'd like to find out good and robust MapReduce framework, to be utilized from Scala.

scala frameworks google-analytics mapreduce

asked Jun 07 '09 at 15:14

Roman Kagan

10,440
26
86
126

votes

2 answers

Rolling your own reduceByKey in Spark Dataset

I'm trying to learn to use DataFrames and DataSets more in addition to RDDs. For an RDD, I know I can do someRDD.reduceByKey((x,y) => x + y), but I don't see that function for Dataset. So I decided to write one. someRdd.map(x =>…

scala apache-spark mapreduce

asked Jul 14 '16 at 19:56

Carlos Bribiescas

4,197
9
35
66

votes

10 answers

IllegalAccessError to guava's StopWatch from org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus

I'm trying to run small spark application and am getting the following exception: Exception in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.()V from class…

hadoop apache-spark mapreduce guava

asked Apr 05 '16 at 13:07

Lika

1,043
2
10
13

Prev 1 2 3

…

99 100 Next