Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions

votes

2 answers

Hadoop Datanode starts on wrong interface

We used 2 interfaces for our hadoop cluster. Private eth-1 and public. It looks like when hadoop datanode starts, it picks public IP address instead of private. When I look at hadoop-cmf-hdfs-DATANODE-hostname.log.out, it shows up STARTUP_MSG:…

asked Oct 20 '15 at 14:35

user2562618

votes

2 answers

MongoDB complex select count group by function

I have a collection called 'my_emails' where are stored email addresses : [ { email:"russel@gmail.com"}, { email:"mickey@yahoo.com"}, { email:"john@yahoo.com"}, ] and I try to get the top 10 hostnames used... [ {host: "gmail.com",…

mysql mongodb mapreduce mongodb-query

asked Oct 07 '15 at 21:06

sly63

votes

3 answers

Hadoop - Classic MapReduce Wordcount

In my Reducer code, I am using this code snippet to sum the values: for(IntWritable val : values) { sum += val.get(); } As the above mentioned gives me expected output, I tried changing the code to: for(IntWritable val : values) { …

hadoop mapreduce

asked Sep 27 '15 at 16:46

AJm

votes

1 answer

Retrieve the position in Array in mongodb

Is it possible to retrieve the position of an array element that matches the query ? For example, I have a collection with documents like this: {"_id":ObjectId("560122469431950bf55cb095"), "otherIds": [100, 103, 108, 104]} And I would like to…

mongodb mapreduce mongodb-query aggregation-framework

asked Sep 24 '15 at 08:58

hiamex

votes

1 answer

How to prevent a Hadoop job to fail when directory is empty?

I have a job that fails when there is no files in the input directory. The exception i get is the following: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:Input Pattern maprfs:/profile/* I know this exception is coming from the…

hadoop mapreduce mapr

asked Sep 22 '15 at 15:34

danilo

votes

1 answer

When should we go for Apache Spark

Would it be wise to replace MR completely with Spark. Here are the areas where we still use MR and need your input to go ahead with Apache Spark option- ETL : Data validation and transformation. Sqoop and custom MR programs using MR API. Machine…

mapreduce apache-spark

asked Sep 16 '15 at 06:59

akshat thakar

1,445
21
29

votes

1 answer

yarn stderr no logger appender and no stdout

I'm running a simple mapreduce program wordcount agian Apache Hadoop 2.6.0. The hadoop is running distributedly (several nodes). However, I'm not able to see any stderr and stdout from yarn job history. (but I can see the syslog) The wordcount…

hadoop mapreduce cloudera hadoop-yarn hortonworks-data-platform

asked Sep 08 '15 at 23:02

pythonician_plus_plus

1,244
3
15
38

votes

1 answer

Error handling in hadoop map reduce

Based on the documentation, there are a few ways, how the error handling is performed in map reduce. Below are the few: a. Custom counters using enum - increment for every failed record. b. Log error and analyze later. Counters give the number of…

hadoop error-handling mapreduce distributed-system

asked Aug 27 '15 at 16:48

Ramzy

6,948
6
18
30

votes

1 answer

Trying to use LZO Compression with MapReduce

I want to use LZO compression in MapReduce, but am getting an error when I run my MapReduce job. I am using Ubuntu with a Java program. I am only trying to run this on my local machine. My initial error is ERROR lzo.GPLNativeCodeLoader: Could not…

java hadoop mapreduce compression hadoop-lzo

asked Aug 20 '15 at 16:00

Matt Cremeens

4,951
7
38
67

votes

5 answers

Who will get a chance to execute first , Combiner or Partitioner?

I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204) Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.…

hadoop mapreduce hadoop-streaming hadoop-partitioning combiners

asked Aug 20 '15 at 06:26

Prashant

votes

0 answers

Getting Chunk of data in Node server from mongoDB server

Hello All I have a collection in mongoDB whoose size is 30K. When I run the find Query (I am using mongoose) from Node server, following problems occur. 1: It takes long time to get result back from datatabase server 2: While creating JSON object…

javascript node.js mongoose mapreduce mean-stack

asked Aug 17 '15 at 08:19

Vishu238

votes

1 answer

Can I use hadoop in Jupyter/IPython

Can I use Hadoop & MapReduce in Jupyter/IPython? Is there something similar to what PySpark for Spark is?

hadoop mapreduce ipython jupyter

asked Aug 12 '15 at 22:38

Fisseha Berhane

2,533
4
30
48

votes

1 answer

Are there any use cases where hadoop map-reduce can do better than apache spark?

I agree that iterative and interactive programming paradigms are very good with spark than map-reduce. And I also agree that we can use HDFS or any hadoop data store like HBase as a storage layer for Spark. Therefore, my question is - Do we have any…

apache-spark hadoop mapreduce

asked Aug 03 '15 at 12:00

Jagadish Talluri

votes

1 answer

MongoDB C# driver 2.0: How to get the result from MapReduceAsync

MongoDB C# driver 2.0: How to get the result from MapReduceAsync I'm using MongoDB version 3, C# driver 2.0 and would get the result of MapReduceAsync method. I have this collection "users": { "_id" : 1, "firstName" : "Rich", "age" : "18" } { "_id"…

c# mongodb asynchronous mapreduce mongodb-csharp-2.0

asked Aug 02 '15 at 21:09

Leonardo JAS

votes

1 answer

Pyspark reduceByKey with (key, Dictionary) tuple

I'm stuck a bit in trying to do a map-reduce on databricks with spark. I want to process log files and I want to reduce to a (key, dict()) tuple. However I'm always getting an error. I'm not hundert percent sure if that's the right way to do it. I'd…

python dictionary mapreduce apache-spark pyspark

asked Jul 26 '15 at 22:03

cdudek

Prev 1 2 3

…

100 Next