Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
2 answers

Aggregate functions with MongoDB/Mongoid and calculated fields

I'm migrating an existing Rails app to use MongoDB (with Mongoid), and I'm having some trouble figuring out how to do aggregations like you can do with MySQL. Previously I had something like SELECT DATE(created_at) AS day, SUM(amount) AS amount…
Avishai
  • 4,512
  • 4
  • 41
  • 67
3
votes
1 answer

I use the function map() in python,and convert the results into a list,but the result is an empty list

Why I am geting a wrong list when I use the function map() in python? here is my code, When I use print() function to print list(r) print(list(r)) my result is an empty list. But When I write the code: rList= list(r) and then print(rList) my…
noobshane
  • 43
  • 5
3
votes
0 answers

How to find average using map reduce in MongoDB?

My document is of format: { "PItems": { "Workspaces": [ { "Key": "Item1", "Size": 228.399, "Foo": "bar" }, { "Key": "Item2", "Size": 111.399, "Bar": "baz" }, { …
Nemo
  • 24,540
  • 12
  • 45
  • 61
3
votes
3 answers

Can we use Hadoop MapReduce for real-time data process?

Hadoop map-reduce and it's echo-systems (like Hive..) we usually use for batch processing. But I would like to know is there any way that we can use hadoop MapReduce for realtime data processing example like live results, live tweets. If not what…
Ketan
  • 89
  • 2
  • 8
3
votes
1 answer

MRJob sort reducer output

Is there any way to sort the output of reducer function using mrjob? I think that the input to reducer function is sorted by the key and I tried to exploit this feature to sort the output using another reducer like below where I know values have…
Dandelion
  • 744
  • 2
  • 13
  • 34
3
votes
4 answers

Clusters available for using Hadoop/MapReduce framework

Does anyone know any free accessible clusters that are open to public and that use a Hadoop/MapReduce framework? There are plenty of tutorials of how to use MapReduce, but is there a way to test the examples without using my local single machine and…
Michael Eilers Smith
  • 8,466
  • 20
  • 71
  • 106
3
votes
0 answers

org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1537255103946_84288_m_000000_0

I am running spark job in oozie by using shell action but spark task is not lunching. Job is stuck and only lunching below task 2018-10-23 09:20:31,021 INFO [IPC Server handler 3 on 32995] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress…
3
votes
1 answer

What is a job history server in Hadoop and why is it mandatory to start the history server before starting Pig in Map Reduce mode?

Before starting Pig in map reduce mode you always have to start the history server else while trying to execute Pig Latin statements the below mentioned logs are generated: 2018-10-18 15:59:13,709 [main] INFO …
Sarvagya Dubey
  • 435
  • 1
  • 7
  • 22
3
votes
3 answers

faster min and max of different array components with CouchDb map/reduce?

I have a CouchDB database with a view whose values are paired numbers of the form [x,y]. For documents with the same key, I need (simultaneously) to compute the minimum of x and the maximum of y. The database I am working with contains about 50000…
3
votes
1 answer

CouchDB - filter latest log per logged instance from a list

I could use some help filtering distinct values from a couchdb view. I have a database that stores logs with information about computers. Periodically new logs for a computer are written to the db. A bit simplified i store entries like these: { …
arie
  • 18,737
  • 5
  • 70
  • 76
3
votes
4 answers

Java reduce a collection of string to a map of occurence

Consider the a list as id1_f, id2_d, id3_f, id1_g, how can I use stream to get a reduced map in format of of statistics like: id1 2 id2 1 id3 1 Note: the key is part before _. Is reduce function can help here?
chrisTina
  • 2,298
  • 9
  • 40
  • 74
3
votes
1 answer

Where does views (map reduce) impact on CloudAnt NoSQL limits?

IBM Cloudant NoSQL has some limits on lookups,write,query per second. On CloudAnt I can write a DesignDocument "View". When I read a view, where does this read impact on? lookups/sec or query/sec? For example this is the view: function (doc) { …
Giovesoft
  • 580
  • 6
  • 21
3
votes
2 answers

Parallelize Python's reduce command

In Python I'm running a command of the form reduce(func, bigArray[1:], bigArray[0]) and I'd like to add parallel processing to speed it up. I am aware I can do this manually by splitting the array, running processes on the separate portions, and…
ajspencer
  • 1,017
  • 1
  • 10
  • 21
3
votes
3 answers

Container is running beyond physical memory limits

I have a MapReduce Job that process 1.4 Tb of data. While doing it, I am getting the error as below. The number of splits is 6444. Before starting the job I set the following settings: conf.set("mapreduce.map.memory.mb",…
Eeelijah
  • 121
  • 2
  • 5
3
votes
1 answer

Hadoop, hardware and bioinformatics

We're about to buy new hardware to run our analyses and are wondering if we're making the right decisions. The setting: We're a bioinformatics lab that will be handling DNA sequencing data. The biggest issue that our field has is the amount of data,…
jandot
  • 4,164
  • 1
  • 16
  • 10