What is the computational complexity of the MapReduce overhead

Question

Given that the complexity of the map and reduce tasks are O(map)=f(n) and O(reduce)=g(n) has anybody taken the time to write down how the Map/Reduce intrinsic operations (sorting, shuffling, sending data, etc.) increases the computational complexity? What is the overhead of the Map/Reduce orchestration?

I know that this is a nonsense when your problem is big enough, just don't care about the inefficiencies, but for small problems that can run in a small machine or a couple of machines, should I go through the pain of designing parallel algorithms when I have a Map/Reduce implementation already at hand?

It's the other way around. Complexity calculations like O() come into effect more when a problem is large. At small data sizes, other factors like communication overhead often dominate the time taken by a function. — tkerwin, Jul 30 '10 at 13:13
Actually its the other was around. Network bandwidth is almost always the most constrained resource in a cluster. In almost all jobs the actual computation is very little of the runtime compared to IO. — Steve Severance, Jul 30 '10 at 15:33

score 2 · Answer 1 · answered Aug 23 '10 at 13:30

For small problems "that can run in a small machine or a couple of machines," yes, you should rewrite them if performance is essential. Like others have pointed out, communication overhead is high.

I don't think anybody has done any complexity analysis on M/R operations because it's so heavily implementation-, machine-, and algorithm-specific. You should get so many variables just for, say, sorting:

O(n log n * s * (1/p)) where:
 - n is the number of items
 - s is the number of nodes
 - p is the ping time between nodes (assuming equal ping times between all nodes in the network)

Does that make sense? It gets really messy really quick. M/R is also a programming framework, not an algorithm in and of itself, and complexity analysis is usually reserved for algorithms.

The closest thing to what you're looking for may be complexity analysis of multi-threaded algorithms, which is much simpler.

score 0 · Answer 2 · answered Oct 09 '10 at 03:07

I know that this is a nonsense when your problem is big enough, just don't care about the inefficiencies, but for small problems that can run in a small machine or a couple of machines, should I go through the pain of designing parallel algorithms when I have a Map/Reduce implementation already at hand?

This is a difficult problem to analyze. On the one hand, if the problem is too small then classical complexity analysis is liable to give the wrong answer due to lower order terms dominating for small N.

On the other hand, complexity analysis where one of the variables is the number of compute nodes will also fail if the number of compute nodes is too small ... once again because of the overheads of the Map/Reduce infrastructure contribution to lower order terms.

So what can you do about it? Well, one approach would be to do a more detailed analysis that does not rely on complexity. Figure out the cost function(s), including the lower order terms and the constants, for your particular implementation of the algorithms and the map/reduce framework. Then substitute values for the problem size variables, the number of nodes etc. Complicated, though you may be able to get by with estimates for certain parts of the cost function.

The second approach is to "suck it and see".

score -1 · Answer 3 · answered Nov 12 '12 at 14:12

-1

Map-Reduce for Machine Learning on Multicore is worth a look, comparing how the complexity of various well known machine learning algorithms changes when changed to a MR-"friendly" form.

Cheers.

answered Nov 12 '12 at 14:12

pagid

13,559
11
78
104

Please comment when you downvote things - especially downvoting super old stuff without comment is a bit strange – pagid Jul 14 '17 at 10:20
The link appears to be dead – seekme_94 Apr 23 '18 at 06:17

What is the computational complexity of the MapReduce overhead

3 Answers3