How can I improve aggregation processing time with Map Reduce?

Question

I am trying to improve the speed of my aggregation query. My idea is to create a new collection with Map Reduce and to run aggregation on it. Will this decrease processing time? What are the drawbacks?

Do you plan to run the MapReduce everytime you want to run the aggregation or do you only intent to run MapReduce from time to time to update the collection on which you run the aggregation more frequently? — Philipp, Nov 14 '14 at 14:53
MR will be from time to time becasue it currently take like 173 sec. I want to improve the aggregation from 0.7 sec to 0.3 sec at least. Therefore I create a dependecy of aggregation with MR. This is what worries me. Will be there a future issue because of this? — Tudor, Nov 14 '14 at 15:02
I rephrase it to make it less opinion-based. If this isn't what you want, feel free to edit it again. — Christian Strempfer, Nov 15 '14 at 13:39
@Christian, I want to upvote, but I am still on 10 reputation. Many thanks for your concern. I wished... I could grow faster... usually I upvote like a maniac... I do like that on Quora. You could also start to upvote the questions of starters :-) so they get faster reputation. — Tudor, Nov 17 '14 at 12:58
@Christian Well done Chris, I am on 22 now, from 12... Now I can upovte the answers, as you wanted. Many thanks for the help! — Tudor, Nov 17 '14 at 13:47

score 1 · Accepted Answer · edited May 23 '17 at 12:12

Yes, it could work. It's a common pattern to improve querying speed. But MongoDB is special in that case, because Map Reduce needs JavaScript evaluation, while the aggregation framework is implemented natively, therefore aggregation is faster.

I advice to separate the concept from the technology. It's still a good idea to do pre-calculations in a batch job, but you should do it with the aggregation framework instead of Map Reduce.

Drawbacks of using batch jobs are

You can only query what ran through your batch job, so that there will be a delay between inserting new data and retrieving it.
It's only faster if you're able to reduce complexity of your real-time query. For example your new query should read fewer documents than without a batch job.
Your application consumes more disk space, because you're creating an additional collection.
It adds complexity. Therefore start with real-time aggregation and only if it becomes a performance problem, do pre-calculations.

For further latency improvements you might consider to implement a Lambda architecture. It reduces querying time to a minimum by pre-calculating results as far as possible.

This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate precomputed views, while simultaneously using real-time stream processing to provide dynamic views. The two view outputs may be joined before presentation.

Chris, many thanks for the answer and the load of information regarding my concern. I will move on with learning more about the Lambda arhitecture. — Tudor, Nov 17 '14 at 13:03

How can I improve aggregation processing time with Map Reduce?

1 Answers1