Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
3 answers

How to find the global average in a large dataset?

I am writing simple mapreduce programs to find the average,smallest number and largest number present in my data(many text files).I guess using a combiner to find the desired stuff for within the numbers processed by a single mapper first would make…
Amy2477
  • 33
  • 1
  • 5
3
votes
1 answer

RT parallel processing in Rails

I'm developing a sort of personalized search engine in Ruby on Rails, and I'm currently trying to find best way of sorting results depending on user's record, in real time. Example: items that are searched for can have tags (separate entities with…
Otigo
  • 575
  • 1
  • 5
  • 5
3
votes
2 answers

MapReduce in PyMongo

My Mongo collection : Impressions has docs in the following format:- { _uid: 10, "impressions": [ { "pos": 6, "id": 123, "service": "furniture" }, …
nimeshkiranverma
  • 1,408
  • 6
  • 25
  • 48
3
votes
1 answer

Sorting in MapReduce Hadoop

I have few basic questions in Hadoop MapReduce. Assume if 100 mappers were executed and zero reducer. Will it generate 100 files? All individual are sorted? Across all mapper output are sorted? Input for reducer is Key -> Values. For each key, all…
Nageswaran
  • 7,481
  • 14
  • 55
  • 74
3
votes
1 answer

Mapreduce error: Failed to setup local dir

I'm running mapreduce wordcount example on hadoop installed on windows 8. I got the error as below. It sounds like a security permission issue. But I'm not very sure. I added a property to yarn-site.xml file as
lijie98
  • 617
  • 1
  • 6
  • 13
3
votes
3 answers

How to combine multiple Hadoop MapReduce Jobs into one?

I have a massive amount of input data (that's why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input. My goal: Compute these different tasks as fast as…
stefanw
  • 10,456
  • 3
  • 36
  • 34
3
votes
2 answers

Renaming part files of PIG output

I have a requirement of changing the part file naming convention after running my PIG job. I want part-r-0000 to be userdefinedName-r-0000. Any possible solution to that? I am avoiding hadoop -cp and hadoop -mv commands. Thanks
Aviral Kumar
  • 814
  • 1
  • 15
  • 40
3
votes
0 answers

bigdata: how to analyze pst/email data?

i have pst or email files in hdfs. now, i want to do text analysis by whichever component available in hadoop which suits the best. how do i start with. Do I have to first extract the actual content out of these files and store it somewhere (in a…
natarajan k
  • 406
  • 9
  • 24
3
votes
3 answers

PHP vs. Other Languages in Hadoop/MapReduce implementations, and in the Cloud generally

I'm beginning to learn some Hadoop/MapReduce, coming mostly from a PHP background, with a little bit of Java and Python. But, it seems like most implementations of MapReduce out there are in Java, Ruby, C++ or Python. I've looked, and it looks…
Yahel
  • 37,023
  • 22
  • 103
  • 153
3
votes
1 answer

How to Sort Reducer Output?

I want to sort the output of my reducer. A sample of my reducer output is shown below: 0,0 2.5 0,1 3.0 1,0 4.0 1,1 1.5 The reducer output is obviously sorted by first element of the key. But I wanted to sort it by the second element of…
Punit Naik
  • 515
  • 7
  • 26
3
votes
0 answers

How to extarct contents of bz2 files - Hadoop

I have a tar archive (about 40 GB) which has many subfolders within which my data resides. The structure is : Folders -> Sub Folders -> json.bz2 files. TAR file: Total size: ~ 40GB Number of inner .bz2 files (arranged in folders): 50,000 Size of one…
3
votes
1 answer

Why we are configuring mapred.job.tracker in YARN?

What I know is YARN is introduced and it replaced JobTracker and TaskTracker. I have seen is some Hadoop 2.6.0/2.7.0 installation tutorials and they are configuring mapreduce.framework.name as yarn and mapred.job.tracker property as local or…
user4498972
3
votes
2 answers

When does an action not run on the driver in Apache Spark?

I have just started with Spark and was struggling with the concept of tasks. Can any one please help me in understanding when does an action (say reduce) not run in the driver program. From the spark tutorial, "Aggregate the elements of the dataset…
ankit409
  • 93
  • 7
3
votes
1 answer

How do we improve a MongoDB MapReduce function that takes too long to retrieve data and gives out of memory errors?

Retrieving data from mongo takes too long, even for small datasets. For bigger datasets we get out of memory errors of the javascript engine. We've tried several schema designs and several ways to retrieve data. How do we optimize mongoDB/mapReduce…
EvaH
  • 33
  • 6
3
votes
1 answer

How to efficiently find top-k elements?

I have a big sequence file storing the tfidf values for documents. Each line represents line and the columns are the value of tfidfs for each term (the row is a sparse vector). I'd like to pick the top-k words for each document using Hadoop. The…
HHH
  • 6,085
  • 20
  • 92
  • 164
1 2 3
99
100