Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
0 answers

Hadoop mapreduce defining separators for streaming

I'm using Hadoop 2.7.1 I'm really struggling to understand at what point in the streaming process sorts are applied, how you can change the sort order, and the separator. Reading the documentation has confused me further since some config variables…
James Owers
  • 7,948
  • 10
  • 55
  • 71
3
votes
1 answer

Mongodb grouping data - mapReduce or aggregation?

I have documents like this: { "_id" : ObjectId("565e906bc2209d91c4357b59"), "userEmail" : "abc@example.com", "subscription" : { "project1" : { "subscribed" : false }, "project2" : { …
Vimalraj Selvam
  • 2,155
  • 3
  • 23
  • 52
3
votes
3 answers

How to keep a state in Hadoop jobs?

I'm working on a hadoop program which is scheduled to run once a day. It takes a bunch of json documents and each document has a time-stamp which shows when the document has been added. My program should only process those documents that are added…
HHH
  • 6,085
  • 20
  • 92
  • 164
3
votes
1 answer

CouchDB returns "wrong" total_rows

I have 5680 documents in my CouchDB. I reduced it with something like: function(doc) { if (doc.address.country && doc.cats) { for (i = 0; i < doc.cats.length; i++) { emit([doc.address.country, doc.cats[i].id], doc); } …
Christian
  • 7,062
  • 5
  • 36
  • 39
3
votes
2 answers

Why can't we use Java primitive data types in Map Reduce?

I am learning Hadoop MapReduce framework . I am struggling to find that Why can't we use Java primitive data types in Map Reduce.
rraghuva
  • 131
  • 1
  • 10
3
votes
4 answers

Which is better for log analysis

I have to analyze Gzip compressed log files which are stored on a production server using Hadoop related tools . I can't decide on how to do that, and what to use, here are some of the methods i thought about using (Feel free to recommend something…
Yaswanth
  • 483
  • 1
  • 9
  • 25
3
votes
1 answer

"Unable to execute HTTP Request: Broken Pipe" with Hadoop / s3 on Amazon EMR

I've developed a custom JAR that I'm using to process data in Elastic MapReduce. The data is several hundred thousands files coming from Amazon S3. The JAR doesn't do anything terribly funky to read data - it's just using…
John Chrysostom
  • 3,973
  • 1
  • 34
  • 50
3
votes
4 answers

No Namenode or Datanode or Secondary NameNode to stop

I installed Hadoop in my Ubuntu 12.04 by following the procedure in the below link. http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php Everything is installed successfully and when I run the start-all.sh only…
Wanderer
  • 447
  • 3
  • 11
  • 20
3
votes
0 answers

How to get rid of the suffix "-r-00xxx" when using Hadoop MultipleOutputs?

In the MR job FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); MultipleOutputs.addNamedOutput(job, OUTPUT, TextOutputFormat.class, NullWritable.class, Text.class); In my Reducer String…
3
votes
2 answers

How to limit the number of reduce jobs in mapreduce java code in hadoop

I'm new to Hadoop and I want to limit the number of reduce jobs in my application. In the cluster, the maximum number of reduce jobs is 120. But, I don't want to use all of them, because my application doesn't need that many number of reduce jobs. I…
seha
  • 51
  • 1
3
votes
2 answers

Why do we need the "map" part in MapReduce?

The programming model MapReduce consists of 2 procedures, map and reduce. Why do we need the map part, when we can simply do the mapping inside reduce function. Consider the following pseudocode: result =…
3
votes
0 answers

How exactly Impala is faster than hive?

There are multiple tools built to access data from Hadoop. Very popular amongst them are Hive and Impala. While Impala was built to address batch nature of Hive (for low cost SQLs), Impala cannot eliminate MapReduce completely as its really great a…
funsuk
  • 71
  • 2
  • 6
3
votes
1 answer

How do you handle different value types in reducer

I am writing a mapreduce program which has 2 mappers and 1 reducer, I implemented custom Writable Datatypes for each mapper. The Datatype is more or less just a container where the fields are Text / Intwritable values. So Mapper 1 outputs id(Text),…
VSEWHGHP
  • 195
  • 2
  • 3
  • 12
3
votes
1 answer

Hadoop - Properly sort by key and group by reducer

I have some data coming out from the reducer which are like this : 9,2 3 5,7 2 2,3 0 1,5 3 6,3 0 4,2 2 7,1 1 And I would like to sort them according to the number on the second column. Like this : 2,3 0 6,3 0 7,1 1 5,7…
Robin Dupont
  • 339
  • 1
  • 2
  • 12
3
votes
1 answer

java.lang.RuntimeException: java.lang.NoSuchMethodException: Hadoop mapreduce

I am getting java.lang.NoSuchMethodException please help me in this ... import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.text.DateFormat; import java.text.SimpleDateFormat; import…
Barath
  • 107
  • 2
  • 14