1

I had always thought that Mongo had excellent performance with it's mapreduce functionality, but am now reading that it is a slow implementation of it. So if I had to pick an alternative to benchmark against, what should it be?

My software will be such that users will often have millions of records, and often be sorting and crunching through unpredictable subsets that are 10s or 100s of thousands. Most of the analysis of data that uses the full millions of records can be done in summary tables and the like. I'd originally thought Hypertable was a viable alternative, but in doing research I saw in their documents their mention that Mongo would be a more performant option, while Hypertable had other benefits. But for my application speed is my number one initial priority.

Community
  • 1
  • 1
Jeremy Smith
  • 14,727
  • 19
  • 67
  • 114
  • 1
    Any particular reason that you want to use a NoSQL database for this? Millions of records is nothing for a decent RDBMS to work with even on modest hardware. – mu is too short Jun 01 '11 at 17:25
  • My competitors are using postgres, and my assumption was always that this new generation of NoSQL databases would blow MySQL or Postgres out of the water. Their current software can easily take 60+ seconds to do any kind of deep analysis... Though they force installation of the db on a user's machine, whereas I am taking it to the cloud and so will be able to distribute the load. – Jeremy Smith Jun 01 '11 at 17:32
  • I would suggest to use hadoop on top of Amazon's elastic mapreduce, not because I actually know, but because some quite successful companies are using it for number crunching. Here's an example of how [Foursquare](http://www.webmonkey.com/2011/03/who-swears-the-most-how-foursquare-used-hadoop-to-find-out/) uses it. – Augusto Jun 01 '11 at 20:03

1 Answers1

1

First of all, it's important to decide on what is "fast enough". Undoubtedly there are faster solutions than MongoDB's map/reduce but in most cases you may be looking at significantly higher development cost.

That said MongoDB's map/reduce runs, at time of writing, on a single thread which means it will not utilize all the cpu available to it. Also, MongoDB has very little in the way of native aggregation functionality. This will change fixed with version 2.1 onwards that should improve performance though (see https://jira.mongodb.org/browse/SERVER-447 and http://www.slideshare.net/cwestin63/mongodb-aggregation-mongosf-may-2011).

Now, what MongoDB is good at is scaling up easily, especially when it comes to reads. And this is important because the best solution for number crunching on large datasets is definitely a map/reduce cloud like Augusto suggested. Let such an m/r do the number crunching while MongoDB makes the required data available at high speeds. Database query throughput too low is easily solved by adding more mongo shards. Number crunching/aggregation performance too slow is solved by adding more m/r boxes. Basically performance becomes a function of number of instances you reserve for the problem, and thus cost.

Remon van Vliet
  • 18,365
  • 3
  • 52
  • 57
  • Thanks Remon, I was at the NY Mongo conference and the aggregation features were basically the thing that made me decide to stick with Mongo. I'm very very happy I found out about it before I switched to Hbase or something similar. – Jeremy Smith Jul 02 '11 at 15:16