57

Is the aggregation framework introduced in mongodb 2.2, has any special performance improvements over map/reduce?

If yes, why and how and how much?

(Already I have done a test for myself, and the performance was nearly same)

didxga
  • 5,935
  • 4
  • 43
  • 58
Taha Jahangir
  • 4,774
  • 2
  • 42
  • 49
  • 1
    "nearly" the same? With which benchmarks? Your remark is basically pointless. And you are comparing cat and cows. In addition you know yourself that the MR is still limit to single-threading....so: pointless question and therefore -1 –  Dec 17 '12 at 06:01
  • @user1833746 It's a question, I don't want to explain my benchmarks. I asked to know new answers to this questioned. Please vote-up to allow others to answer. – Taha Jahangir Dec 17 '12 at 06:59
  • have you seen this question (and answers)? http://stackoverflow.com/questions/12139149/mapreduce-with-mongodb-really-really-slow-30-hours-vs-20-minutes-in-mysql-for – Asya Kamsky Dec 17 '12 at 08:57
  • @Asya Yes, see my benchmark below – Taha Jahangir Dec 17 '12 at 09:24
  • 1
    Please refer this link for more understand. https://runnable.com/blog/pipelines-vs-map-reduce-to-speed-up-data-aggregation-in-mongodb – soheshdoshi Feb 12 '21 at 10:22

2 Answers2

66

Every test I have personally run (including using your own data) shows aggregation framework being a multiple faster than map reduce, and usually being an order of magnitude faster.

Just taking 1/10th of the data you posted (but rather than clearing OS cache, warming the cache first - because I want to measure performance of the aggregation, and not how long it takes to page in the data) I got this:

MapReduce: 1,058ms
Aggregation Framework: 133ms

Removing the $match from aggregation framework and {query:} from mapReduce (because both would just use an index and that's not what we want to measure) and grouping the entire dataset by key2 I got:

MapReduce: 18,803ms
Aggregation Framework: 1,535ms

Those are very much in line with my previous experiments.

Asya Kamsky
  • 41,784
  • 5
  • 109
  • 133
  • for additional comments on this see answer to http://stackoverflow.com/questions/12139149/mapreduce-with-mongodb-really-really-slow-30-hours-vs-20-minutes-in-mysql-for – Asya Kamsky Dec 18 '12 at 16:43
  • Thanks for answering the first portion of the question! What about the second part? Why and how? Do you have something to add for that? Thank you for any input. – Jeach Aug 03 '14 at 14:26
  • 1
    this is covered in the docs - but in a nutshell, aggregation runs natively in the server (C++), MapReduce spawns separate javascript thread(s) to run JS code. – Asya Kamsky Aug 04 '14 at 17:39
9

My benchmark:

== Data Generation ==

Generate 4million rows (with python) easy with approximately 350 bytes. Each document has these keys:

  • key1, key2 (two random columns to test indexing, one with cardinality of 2000, and one with cardinality of 20)
  • longdata: a long string to increase size of each document
  • value: a simple number (const 10) to test aggregation

db = Connection('127.0.0.1').test # mongo connection
random.seed(1)
for _ in range(2):
    key1s = [hexlify(os.urandom(10)).decode('ascii') for _ in range(10)]
    key2s = [hexlify(os.urandom(10)).decode('ascii') for _ in range(1000)]
    baddata = 'some long date ' + '*' * 300
    for i in range(2000):
        data_list = [{
                'key1': random.choice(key1s),
                'key2': random.choice(key2s),
                'baddata': baddata,
                'value': 10,
                } for _ in range(1000)]
        for data in data_list:
            db.testtable.save(data)
Total data size was about 6GB in mongo. (and 2GB in postgres)

== Tests ==

I did some test, but one is enough to comparing results:

NOTE: Server is restarted, and OS cache is cleaned after each query, to ignore effect of caching.

QUERY: aggregate all rows with key1=somevalue (about 200K rows) and sum value for each key2

  • map/reduce 10.6 sec
  • aggreate 9.7 sec
  • group 10.3 sec

queries:

map/reduce:

db.testtable.mapReduce(function(){emit(this.key2, this.value);}, function(key, values){var i =0; values.forEach(function(v){i+=v;}); return i; } , {out:{inline: 1}, query: {key1: '663969462d2ec0a5fc34'} })

aggregate:

db.testtable.aggregate({ $match: {key1: '663969462d2ec0a5fc34'}}, {$group: {_id: '$key2', pop: {$sum: '$value'}} })

group:

db.testtable.group({key: {key2:1}, cond: {key1: '663969462d2ec0a5fc34'}, reduce: function(obj,prev) { prev.csum += obj.value; }, initial: { csum: 0 } })

Taha Jahangir
  • 4,774
  • 2
  • 42
  • 49
  • 4
    group is not aggregation framework, it's part of map/reduce. That's why it has a reduce function. See the difference here: http://docs.mongodb.org/manual/reference/command/group/ and http://docs.mongodb.org/manual/reference/aggregation/#_S_group If you were using aggregation framework you would be call db.collection.aggregate( [ pipeline ] ) – Asya Kamsky Dec 17 '12 at 09:38
  • I have a suggestion: why don't you take out the query and run the same thing on your entire collection and see if there is a difference in performance. – Asya Kamsky Dec 17 '12 at 09:55
  • 4
    another problem with your benchmark is you cleared OS cache? So you were measuring mostly the time it takes to page the data into RAM. It dwarfs the actual performance numbers, and it's not a realistic scenario. – Asya Kamsky Dec 17 '12 at 10:08