Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
2 answers

Hadoop Datanode starts on wrong interface

We used 2 interfaces for our hadoop cluster. Private eth-1 and public. It looks like when hadoop datanode starts, it picks public IP address instead of private. When I look at hadoop-cmf-hdfs-DATANODE-hostname.log.out, it shows up STARTUP_MSG:…
user2562618
  • 327
  • 6
  • 14
3
votes
2 answers

MongoDB complex select count group by function

I have a collection called 'my_emails' where are stored email addresses : [ { email:"russel@gmail.com"}, { email:"mickey@yahoo.com"}, { email:"john@yahoo.com"}, ] and I try to get the top 10 hostnames used... [ {host: "gmail.com",…
sly63
  • 305
  • 2
  • 6
3
votes
3 answers

Hadoop - Classic MapReduce Wordcount

In my Reducer code, I am using this code snippet to sum the values: for(IntWritable val : values) { sum += val.get(); } As the above mentioned gives me expected output, I tried changing the code to: for(IntWritable val : values) { …
AJm
  • 993
  • 2
  • 20
  • 39
3
votes
1 answer

Retrieve the position in Array in mongodb

Is it possible to retrieve the position of an array element that matches the query ? For example, I have a collection with documents like this: {"_id":ObjectId("560122469431950bf55cb095"), "otherIds": [100, 103, 108, 104]} And I would like to…
hiamex
  • 295
  • 2
  • 11
3
votes
1 answer

How to prevent a Hadoop job to fail when directory is empty?

I have a job that fails when there is no files in the input directory. The exception i get is the following: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:Input Pattern maprfs:/profile/* I know this exception is coming from the…
danilo
  • 834
  • 9
  • 25
3
votes
1 answer

When should we go for Apache Spark

Would it be wise to replace MR completely with Spark. Here are the areas where we still use MR and need your input to go ahead with Apache Spark option- ETL : Data validation and transformation. Sqoop and custom MR programs using MR API. Machine…
akshat thakar
  • 1,445
  • 21
  • 29
3
votes
1 answer

yarn stderr no logger appender and no stdout

I'm running a simple mapreduce program wordcount agian Apache Hadoop 2.6.0. The hadoop is running distributedly (several nodes). However, I'm not able to see any stderr and stdout from yarn job history. (but I can see the syslog) The wordcount…
3
votes
1 answer

Error handling in hadoop map reduce

Based on the documentation, there are a few ways, how the error handling is performed in map reduce. Below are the few: a. Custom counters using enum - increment for every failed record. b. Log error and analyze later. Counters give the number of…
Ramzy
  • 6,948
  • 6
  • 18
  • 30
3
votes
1 answer

Trying to use LZO Compression with MapReduce

I want to use LZO compression in MapReduce, but am getting an error when I run my MapReduce job. I am using Ubuntu with a Java program. I am only trying to run this on my local machine. My initial error is ERROR lzo.GPLNativeCodeLoader: Could not…
Matt Cremeens
  • 4,951
  • 7
  • 38
  • 67
3
votes
5 answers

Who will get a chance to execute first , Combiner or Partitioner?

I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204) Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.…
3
votes
0 answers

Getting Chunk of data in Node server from mongoDB server

Hello All I have a collection in mongoDB whoose size is 30K. When I run the find Query (I am using mongoose) from Node server, following problems occur. 1: It takes long time to get result back from datatabase server 2: While creating JSON object…
Vishu238
  • 673
  • 4
  • 17
3
votes
1 answer

Can I use hadoop in Jupyter/IPython

Can I use Hadoop & MapReduce in Jupyter/IPython? Is there something similar to what PySpark for Spark is?
Fisseha Berhane
  • 2,533
  • 4
  • 30
  • 48
3
votes
1 answer

Are there any use cases where hadoop map-reduce can do better than apache spark?

I agree that iterative and interactive programming paradigms are very good with spark than map-reduce. And I also agree that we can use HDFS or any hadoop data store like HBase as a storage layer for Spark. Therefore, my question is - Do we have any…
Jagadish Talluri
  • 688
  • 5
  • 13
3
votes
1 answer

MongoDB C# driver 2.0: How to get the result from MapReduceAsync

MongoDB C# driver 2.0: How to get the result from MapReduceAsync I'm using MongoDB version 3, C# driver 2.0 and would get the result of MapReduceAsync method. I have this collection "users": { "_id" : 1, "firstName" : "Rich", "age" : "18" } { "_id"…
3
votes
1 answer

Pyspark reduceByKey with (key, Dictionary) tuple

I'm stuck a bit in trying to do a map-reduce on databricks with spark. I want to process log files and I want to reduce to a (key, dict()) tuple. However I'm always getting an error. I'm not hundert percent sure if that's the right way to do it. I'd…
cdudek
  • 123
  • 2
  • 10
1 2 3
99
100