Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions

votes

8 answers

Large scale Machine Learning

I need to run various machine learning techniques on a big dataset (10-100 billions records) The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some bayesian…

asked Jul 08 '10 at 23:58

user387263

votes

6 answers

array_reduce() can't work as associative-array "reducer" for PHP?

I have an associative array $assoc, and need to reduce to it to a string, in this context $OUT = "$v) $OUT.= " $k=\"$v\""; $OUT.= '/>'; How to do in an elegant way the same thing, but using array_reduce() Near the…

php mapreduce associative-array

asked Mar 23 '15 at 14:44

Peter Krauss

13,174
24
167
304

votes

2 answers

MapReduce or Spark?

I have tested hadoop and mapreduce with cloudera and I found it pretty cool, I thought I was the most recent and relevant BigData solution. But few days ago, I found this : https://spark.incubator.apache.org/ A "Lightning fast cluster computing…

apache-spark hadoop mapreduce

asked Mar 04 '14 at 09:23

Nosk

votes

3 answers

Advantages of using NullWritable in Hadoop

What are the advantages of using NullWritable for null keys/values over using null texts (i.e. new Text(null)). I see the following from the «Hadoop: The Definitive Guide» book. NullWritable is a special type of Writable, as it has a zero-length…

java hadoop mapreduce

asked Apr 24 '13 at 17:43

Venk K

1,157
5
14
25

votes

9 answers

Writing to HDFS could only be replicated to 0 nodes instead of minReplication (=1)

I have 3 data nodes running, while running a job i am getting the following given below error , java.io.IOException: File /user/ashsshar/olhcache/loaderMap9b663bd9 could only be replicated to 0 nodes instead of minReplication (=1). There are 3…

java hadoop mapreduce hive hdfs

asked Mar 22 '13 at 13:29

Ashish Sharma

1,597
7
24
35

votes

4 answers

Using map/reduce for mapping the properties in a collection

Update: follow-up to MongoDB Get names of all keys in collection. As pointed out by Kristina, one can use Mongodb 's map/reduce to list the keys in a collection: db.things.insert( { type : ['dog', 'cat'] } ); db.things.insert( { egg : ['cat'] }…

mongodb mapreduce

asked Jun 08 '10 at 11:55

Andrea Fiore

1,628
2
14
18

votes

5 answers

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL. Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of…

hadoop olap mapreduce hbase hive

asked Sep 14 '09 at 21:59

Niels Basjes

10,424
9
50
66

votes

3 answers

Large Block Size in HDFS! How is the unused space accounted for?

We all know that the block size in HDFS is pretty large (64M or 128M) as compared to the block size in traditional file systems. This is done in order to reduce the percentage of seek time compared to the transfer time (Improvements in transfer rate…

hadoop mapreduce hdfs

asked Oct 22 '12 at 13:52

Abhishek Jain

4,478
8
34
51

votes

2 answers

Hadoop : java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

My program looks like public class TopKRecord extends Configured implements Tool { public static class MapClass extends Mapper { public void map(Text key, Text value, Context context) throws IOException,…

java hadoop mapreduce

asked Aug 02 '12 at 19:48

daydreamer

87,243
191
450
722

votes

4 answers

hadoop: difference between 0 reducer and identity reducer?

I am just trying to confirm my understanding of difference between 0 reducer and identity reducer. 0 reducer means reduce step will be skipped and mapper output will be the final out Identity reducer means then shuffling/sorting will still take…

hadoop mapreduce

asked May 17 '12 at 05:44

kee

10,969
24
107
168

votes

4 answers

Hive unable to manually set number of reducers

I have the following hive query: select count(distinct id) as total from mytable; which automatically spawns: 1408 Mappers 1 Reducer I need to manually set the number of reducers and I have tried the following: set mapred.reduce.tasks=50 set…

hadoop mapreduce hive

asked Jan 06 '12 at 17:43

magicalo

votes

3 answers

Hadoop namenode : Single point of failure

The Namenode in the Hadoop architecture is a single point of failure. How do people who have large Hadoop clusters cope with this problem?. Is there an industry-accepted solution that has worked well wherein a secondary Namenode takes over in case…

hadoop mapreduce hdfs hadoop-yarn hadoop2

asked Dec 21 '10 at 17:46

rakeshr

1,027
3
17
25

votes

5 answers

Gradle Transitive dependency exclusion is not working as expected. (How do I get rid of com.google.guava:guava-jdk5:13.0 ?)

here is a snippet of my build.gradle: compile 'com.google.api-client:google-api-client:1.19.0' compile 'com.google.apis:google-api-services-oauth2:v2-rev77-1.19.0' compile 'com.google.apis:google-api-services-plus:v1-rev155-1.19.0' compile…

java google-app-engine mapreduce gradle guava

asked Sep 11 '14 at 16:15

unify

6,161
4
33
34

votes

7 answers

How to specify AWS Access Key ID and Secret Access Key as part of a amazon s3n URL

I am passing input and output folders as parameters to mapreduce word count program from webpage. Getting below error: HTTP Status 500 - Request processing failed; nested exception is java.lang.IllegalArgumentException: AWS Access Key ID and…

hadoop amazon-web-services amazon-s3 mapreduce hadoop2

asked Jul 24 '14 at 03:48

user3795951

votes

4 answers

Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable

I am trying to run a map/reducer in java. Below are my files WordCount.java package counter; public class WordCount extends Configured implements Tool { public int run(String[] arg0) throws Exception { Configuration conf = new…

java hadoop mapreduce

asked Jun 23 '13 at 15:17

Neil

1,715
6
30
45

Prev 1 2 3

…

99 100 Next