Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
28
votes
8 answers

Large scale Machine Learning

I need to run various machine learning techniques on a big dataset (10-100 billions records) The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some bayesian…
user387263
  • 291
  • 3
  • 5
28
votes
6 answers

array_reduce() can't work as associative-array "reducer" for PHP?

I have an associative array $assoc, and need to reduce to it to a string, in this context $OUT = "$v) $OUT.= " $k=\"$v\""; $OUT.= '/>'; How to do in an elegant way the same thing, but using array_reduce() Near the…
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304
28
votes
2 answers

MapReduce or Spark?

I have tested hadoop and mapreduce with cloudera and I found it pretty cool, I thought I was the most recent and relevant BigData solution. But few days ago, I found this : https://spark.incubator.apache.org/ A "Lightning fast cluster computing…
Nosk
  • 753
  • 2
  • 6
  • 24
28
votes
3 answers

Advantages of using NullWritable in Hadoop

What are the advantages of using NullWritable for null keys/values over using null texts (i.e. new Text(null)). I see the following from the «Hadoop: The Definitive Guide» book. NullWritable is a special type of Writable, as it has a zero-length…
Venk K
  • 1,157
  • 5
  • 14
  • 25
28
votes
9 answers

Writing to HDFS could only be replicated to 0 nodes instead of minReplication (=1)

I have 3 data nodes running, while running a job i am getting the following given below error , java.io.IOException: File /user/ashsshar/olhcache/loaderMap9b663bd9 could only be replicated to 0 nodes instead of minReplication (=1). There are 3…
Ashish Sharma
  • 1,597
  • 7
  • 24
  • 35
27
votes
4 answers

Using map/reduce for mapping the properties in a collection

Update: follow-up to MongoDB Get names of all keys in collection. As pointed out by Kristina, one can use Mongodb 's map/reduce to list the keys in a collection: db.things.insert( { type : ['dog', 'cat'] } ); db.things.insert( { egg : ['cat'] }…
Andrea Fiore
  • 1,628
  • 2
  • 14
  • 18
26
votes
5 answers

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL. Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of…
Niels Basjes
  • 10,424
  • 9
  • 50
  • 66
26
votes
3 answers

Large Block Size in HDFS! How is the unused space accounted for?

We all know that the block size in HDFS is pretty large (64M or 128M) as compared to the block size in traditional file systems. This is done in order to reduce the percentage of seek time compared to the transfer time (Improvements in transfer rate…
Abhishek Jain
  • 4,478
  • 8
  • 34
  • 51
26
votes
2 answers

Hadoop : java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

My program looks like public class TopKRecord extends Configured implements Tool { public static class MapClass extends Mapper { public void map(Text key, Text value, Context context) throws IOException,…
daydreamer
  • 87,243
  • 191
  • 450
  • 722
26
votes
4 answers

hadoop: difference between 0 reducer and identity reducer?

I am just trying to confirm my understanding of difference between 0 reducer and identity reducer. 0 reducer means reduce step will be skipped and mapper output will be the final out Identity reducer means then shuffling/sorting will still take…
kee
  • 10,969
  • 24
  • 107
  • 168
25
votes
4 answers

Hive unable to manually set number of reducers

I have the following hive query: select count(distinct id) as total from mytable; which automatically spawns: 1408 Mappers 1 Reducer I need to manually set the number of reducers and I have tried the following: set mapred.reduce.tasks=50 set…
magicalo
  • 463
  • 2
  • 5
  • 12
25
votes
3 answers

Hadoop namenode : Single point of failure

The Namenode in the Hadoop architecture is a single point of failure. How do people who have large Hadoop clusters cope with this problem?. Is there an industry-accepted solution that has worked well wherein a secondary Namenode takes over in case…
rakeshr
  • 1,027
  • 3
  • 17
  • 25
25
votes
5 answers

Gradle Transitive dependency exclusion is not working as expected. (How do I get rid of com.google.guava:guava-jdk5:13.0 ?)

here is a snippet of my build.gradle: compile 'com.google.api-client:google-api-client:1.19.0' compile 'com.google.apis:google-api-services-oauth2:v2-rev77-1.19.0' compile 'com.google.apis:google-api-services-plus:v1-rev155-1.19.0' compile…
unify
  • 6,161
  • 4
  • 33
  • 34
25
votes
7 answers

How to specify AWS Access Key ID and Secret Access Key as part of a amazon s3n URL

I am passing input and output folders as parameters to mapreduce word count program from webpage. Getting below error: HTTP Status 500 - Request processing failed; nested exception is java.lang.IllegalArgumentException: AWS Access Key ID and…
user3795951
  • 321
  • 2
  • 5
  • 7
25
votes
4 answers

Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable

I am trying to run a map/reducer in java. Below are my files WordCount.java package counter; public class WordCount extends Configured implements Tool { public int run(String[] arg0) throws Exception { Configuration conf = new…
Neil
  • 1,715
  • 6
  • 30
  • 45