Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
1 answer

In hadoop what's under replication and over replication mean and how does it work?

IN map reduce concept under replica and over replica to use. how to balance the over replica and under replica.
Veer
  • 31
  • 1
  • 3
3
votes
2 answers

Hadoop MapReduce example for string transformation

I have a big amount of strings in some text file and need transform this strings by such algorithm: convert string into lowercase and remove all spaces. Can you give me example of Hadoop MapReduce function which implements that algorithm? Thank you.
Alex Zhulin
  • 1,239
  • 2
  • 22
  • 42
3
votes
1 answer

Get summary for multiple class fields in Java8 using map reduce technique

Below is Student class definition : class Student{ String name; int sub1; int sub2; int sub3; // etc...etc... } I have list of all students. Requirement is to get average of sub1, sub2 and sub3. And also get min mark and max…
Robby Goz
  • 57
  • 6
3
votes
1 answer

Map/Reduce in CouchDB with multiple parameters?

I am wondering how to use CouchDB's map/reduce with multiple parameters. For example, if I have teams that have players with ages and genders, I assume I would do this for my map function: "function(doc){ if(doc.team_name) { …
user21293
  • 6,439
  • 11
  • 44
  • 57
3
votes
2 answers

How to return the object with just a selected few embedded objects?

My structure is as follows: { day: x, events: [ { year: y, info: z } ] } Up to now I created the following query, which I does not return an error but does show anything either (which is…
3
votes
1 answer

Exceptions while coverting CSV to ORC

I am trying to write a mapreduce program which takes input as CSV and writes as ORC format but facing NullPointerException exception. Below is the exception stack trace i am getting java.lang.Exception: java.lang.NullPointerException at…
Kunal
  • 105
  • 1
  • 7
3
votes
1 answer

Issue with Resource manager connection while submitting map reduce job

I am getting resource manager connecting issue while submitting map reduce job in Hortonworks Hadoop cluster. 15/12/03 16:58:27 INFO client.RMProxy: Connecting to ResourceManager at /:8050 15/12/03 16:58:29 INFO ipc.Client: Retrying connect to…
3
votes
0 answers

Why combiner input records are more than mapper output records?

Combiner works on output records of mapper. If the mapper output records are fed to the combiner then why are my combiner input records are more than mapper output records? I got these 80 records extra.I have no idea from where they came & what…
shriyog
  • 938
  • 1
  • 13
  • 26
3
votes
1 answer

Spark - Group by Key then Count by Value

I have non-unique key-value pairs that I have created using the map function from an RDD Array[String] val kvPairs = myRdd.map(line => (line(0), line(1))) This produces data of format: 1, A 1, A 1, B 2, C I would like to group all of they keys by…
Brian
  • 7,098
  • 15
  • 56
  • 73
3
votes
1 answer

Embeddable open-source key-value storage with liberal license

Is there any open-source document-oriented key-value map/reduce storage that: is easily embeddable (Yes, it is possible to embed, let's say CouchDB, but it might be a pain to take the whole erlang machine onboard and I just don't feel good about it…
VB.
  • 474
  • 2
  • 10
3
votes
1 answer

How to sort comma separated keys in Reducer ouput?

I am running an RFM Analysis program using MapReduce. The OutputKeyClass is Text.class and I am emitting comma separated R (Recency), F (Frequency), M (Monetory) as the key from Reducer where R=BigInteger, F=Binteger, M=BigDecimal and the value is…
Punit Naik
  • 515
  • 7
  • 26
3
votes
3 answers

How to Group mongodb - mapReduce output?

i have a query regarding the mapReduce framework in mongodb, so i have a result of key value pair from mapReduce function , now i want to run the query on this output of mapReduce. So i am using mapReduce to find out the stats of user like…
user29578
  • 689
  • 7
  • 21
3
votes
1 answer

MapReduce not working in CakePHP 3.x

I'm using CakePHP 3.x, my application has add/edit pages, in edit action I'm using this code. $patient = $this->Patients->get($patientId); to get record of patient. Now I want to modify value of some field after find operation, let say I want to…
Dr Magneto
  • 981
  • 1
  • 8
  • 18
3
votes
2 answers

Counting number of records that where date is in date range?

I have a collection with documents like below: {startDate: ISODate("2016-01-02T00:00:00Z"), endDate: ISODate("2016-01-05T00:00:00Z")}, {startDate: ISODate("2016-01-02T00:00:00Z"), endDate: ISODate("2016-01-08T00:00:00Z")}, {startDate:…
Abe Miessler
  • 82,532
  • 99
  • 305
  • 486
3
votes
2 answers

Spark: Efficient way to get top K frequent values per key in (key, value) RDD?

I've an RDD of (key, value) pairs. I need to fetch top k values according to their frequencies for each key. I understand that the best way to do this would be using combineByKey. Currently here is what my combineByKey combinators look like object…
sushant-hiray
  • 1,838
  • 2
  • 21
  • 28