Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
1 answer

Unable to generate jar file for Hadoop

I have 16 Java files and I am trying to generate JAR files for the Hadoop ecosystem using the below command: javac -classpath /usr/local/hadoop/hadoop-core-1.0.3.jar:/usr/local/hadoop/lib/commons-cli-1.2.jar JsonV.java JsonV.java is the class…
Sudhir Belagali
  • 370
  • 1
  • 7
  • 22
3
votes
1 answer

Apache Spark and non-serializable application context

I'm new in Spark. I want to parallelize my computations using Spark and map-reduce approach. But this computations, which I put into PairFunction implementation for the Map stage, requres some context to be initialized. This context includes several…
pikkvile
  • 2,531
  • 2
  • 17
  • 16
3
votes
1 answer

Memory effective way to transform collection in mongodb

I have collection like this in mongodb: { "_id" : ObjectId("56a5f47ed420cf0db5b70242"), "tag" : "swift", "values" : [ { "word" : "osx", "value" : 0.02 …
Axazeano
  • 890
  • 9
  • 23
3
votes
1 answer

Deep Learning: is there any open-source library that can be integrated with Hadoop streaming and MapReduce?

Google search popped out quite a few open source deep learning frameworks. Here is a collected list Google…
Osiris
  • 1,007
  • 4
  • 17
  • 30
3
votes
2 answers

What exactly is SparkSQL?

I am very new to this whole world of "big data" tech, and recently started reading about Spark. One thing that keeps coming up is SparkSQL, yet I consistently fail to comprehend was exactly it is. Is it supposed to convert SQL queries to MapReduce…
user2535982
3
votes
1 answer

performing priority query in mongo

sample document : {"name":"John", "age":35, "address":".....",.....} Employees whose join_month=3 is priority 1 Employees whose address contains the string "Avenue" is priority 2 Employees whose address contains the string "Street" is priority 3…
Thomas
  • 191
  • 2
  • 12
3
votes
1 answer

how to iterate an object in mongodb documents?

the document in my mongo collection like this: { "_id" : ObjectId("568f7e67676b4ddf133999e8"), "auth_dic" : { "2406" : [ "44735" ], "6410" : [ "223423" ] ... ... …
ray.li
  • 31
  • 2
3
votes
2 answers

Why is hadoop mapReduce with python failing but the scripts are working on command line?

I'm trying to implement a simple Hadoop map reduce example using Cloudera 5.5.0 The map & reduce steps should be implemented using Python 2.6.6 Problem: If the scripts are being executed on the unix command line they're working perfectly fine and…
Marco P.
  • 81
  • 5
3
votes
1 answer

MRUnit test giving NULLPOINTER exception while writing to HDFS using MULTIPLEOUTPUTS

I currently have a mapReduce program that send data to hdfs with different file name.So in my reducer I am using MultipleOutputs to write to different files in HDFS (Full Reducer code below). I would like to test my code using mrunit and below is my…
himaja
  • 33
  • 4
3
votes
1 answer

mapReduce using node.js and mongoose

I am trying to get a count of number of students per locality. I have a model that looks like var mongoose = require('mongoose'); var schema = mongoose.Schema; var studentSchema = new mongoose.Schema( { "name":String, "address" :{ …
Rahul Ganguly
  • 1,908
  • 5
  • 24
  • 36
3
votes
1 answer

How do I use Elastic MapReduce to run an XSLT transformation on millions of small S3 xml files?

More specifically, is there a somewhat easy streaming solution?
zack
  • 31
  • 1
3
votes
1 answer

Oozie: Launch Map-Reduce from Oozie action?

I am trying to execute a Map-Reduce task in an Oozie workflow using a action. O'Reilley's Apache Oozie (Islam and Srinivasan 2015) notes that: While it’s not recommended, Java action can be used to run Hadoop MapReduce jobs because MapReduce…
Suriname0
  • 527
  • 1
  • 8
  • 21
3
votes
1 answer

Hdfs text file to parquet format using map reduce job

I am trying to convert a hdfs text file to Parquet format using map reduce in java. Honestly I am starter of this and am unable to find any direct references. Should the conversion be textfile --> avro ---> parquet.. ?
Pradeep
  • 850
  • 2
  • 14
  • 27
3
votes
2 answers

How to implement references in map-reduce databases?

I am starting to study map-reduce databases. How can one implement a reference in a map-reduce database, such as CouchDB or MongoDB? For example, suppose that I have drivers and cars, and I want to mark that some driver drives a car. In SQL it's…
Little Bobby Tables
  • 5,261
  • 2
  • 39
  • 49
3
votes
1 answer

How to improve the speed of Solr Indexing building time in MapReduce

I wrote a mapreduce job to generate solr index for my data. I did the generation in the reducer.But the speed is really slow. Is there any way to improve the speed? The code listed below is the code inside the reducer. Is there anything wrong in my…
Cheng Chen
  • 241
  • 3
  • 17