41

I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.

The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?

Thank you

Community
  • 1
  • 1
Mark Bao
  • 896
  • 1
  • 10
  • 19
  • Is Mongodb too slow? Did you test the performance? I see no reason to switch if performance is fine. – TTT Jun 10 '10 at 08:36
  • It's not too slow, but if the situation is that I have to serve a page within 10 milliseconds or risk backing up the server, it's better if something's faster, even by 10%. – Mark Bao Jun 10 '10 at 08:51
  • Don't know about Redis, but if you're concerned about read performance, I'd go with MongoDB over Cassandra. See my answer below for more details. – Data Monk Mar 26 '11 at 09:01
  • A bit late, but for those who end up here, you should put a buffer between the reception of the request and putting the data in actual data store, that way you have minimal latency on your user request. Then in a separated process you put the data in a final store. Kafka is a good candidate for it. (use google to have more info on modern architecture). – Kazaag Mar 05 '15 at 15:23

9 Answers9

31

For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writes to disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by you and your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.

In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesary data you need is pushed.

Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)

MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.

I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.

Skrylar
  • 762
  • 6
  • 16
22

I currently work for a very large ad network and we write to flat files :)

I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).

If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.

Gates VP
  • 44,957
  • 11
  • 105
  • 108
  • Heh, nice! Do you guys also do the impression validation route as well? Yeah - I was just concerned because while Mongo can handle lots of data, its might be inferior to Redis, and every millisecond counts. – Mark Bao Jun 10 '10 at 08:49
  • 2
    Depends what you mean by "validation". You can't reliably validate impressions / clicks real-time, you have to do that afterwards in larger batches. – Gates VP Jun 10 '10 at 17:28
  • 1
    If you dont mind telling, how do you write to local files? is that time based file writers? file per thread? how do you format the data ? control character seperated? – DarthVader Feb 25 '12 at 08:48
12

All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.

In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.

The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?

MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.

Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.

I'm not familiar enough with Redis to speak intelligently about it. Sorry.

If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.

Hope this helps.

Data Monk
  • 1,279
  • 9
  • 15
  • Redis is like Cassandra in that regard: a key-value datastore. You can't efficiently query for all first names starting with "Alex" unless you build those keys in advance. – atp Aug 02 '11 at 23:04
  • @ash that is not quite right, at this time. Cassandra is a column store, essentially mapping a key to a map. You can think of it as a key value store, where the values are each maps. It is extremely efficient to perform a columnrange query, as the row is stored on a single machine, while rowrange queries are more expensive. I.E. searching for (some_row_key)=>(name):data provides extremely fast wild cards on name for a given row. Each row, however, may be sharded onto a different box. – Cory Dolphin Jul 15 '13 at 22:07
  • 1
    Also, Mongo will take a lock on the whole table when writing, – Cory Dolphin Jul 15 '13 at 22:10
9

Just found this: http://blog.axant.it/archives/236

Quoting the most interesting part:

This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.

I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.

drdaeman
  • 11,159
  • 7
  • 59
  • 104
6

According to the Benchmarking Top NoSQL Databases (download here) I recommend Cassandra. enter image description here

Phat H. VU
  • 2,350
  • 1
  • 21
  • 30
  • 8
    For disclaimer purposes, the company that produced this benchmark was commissioned by Datastax, which is a vendor of Cassandra related products and services. – kewne Jul 19 '17 at 11:13
  • 1
    The benchmark link dose not exist anymore, please update it. – Ali Bigdeli Nov 01 '21 at 07:57
3

If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.

You will also get horizontal scaling options with Redis that you may not get with file based caching.

Ben Hughes
  • 2,527
  • 1
  • 18
  • 16
3

I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.

EhevuTov
  • 20,205
  • 16
  • 66
  • 71
  • 2
    How does Node.js help with making things more asynchronous? I think there are dozens of other (some probably even better) ways of making things asynchronous without using Node – eddyP23 Jul 19 '17 at 08:32
2

The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.

Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.

SQLite now supports Write Ahead Logging, which may also provide adequate performance.

Paul Harrison
  • 455
  • 3
  • 6
-8

I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.

I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.

Peter Long
  • 3,964
  • 2
  • 22
  • 18
  • 1
    Mysql is generally much slower than mongo – ADAM Aug 30 '11 at 00:10
  • 30
    your credibility vaporized with that last sentence – William Oct 20 '11 at 21:44
  • 4
    These claims mean little without numbers (how many inserts? How large), and without describing how you did the measurements (e.g. you can't max out a Cassandra node with a single client - it takes multiple clients) – DNA Nov 24 '11 at 14:19