We have ~1Tb of user profiles and need to perform two types operations on them:
- random reads and writes (~20k profile updates per second)
- queries on predefined dimensions (e.g. for reporting)
For example, if we encounter user in a transaction, we want to update his profile with a URL he came from. At the end of the day we want to see all users who visited particular URL. We don't need joins, aggregations, etc., only filtering by one or several fields.
We don't really care about latency, but need high throughput.
Most databases we looked at belong to one of two categories - key-value DBs with fast random access or batch DBs optimized for querying and analytics.
Key-value storages
Aerospike can store terabyte-scale data and is very well-optimized for fast key-based lookup. However, queries on secondary index are deadly slow, which makes it unsuitable for our purposes.
MongoDB is pretty flexible, but requires too much hardware to handle our load. In addition, we encountered particular issues with massive exports from it.
HBase looks attractive since we already have Hadoop cluster. Yet, it's not really clear how to create secondary index for it and what its performance will be.
Cassandra - may be an option, but we don't have experience with it (if you do, please share it)
Couchbase - may be an option, but we don't have experience with it (if you do, please share it)
Analytic storages
Relational DBMS (e.g. Oracle, PostreSQL) provide both - random access and efficient queries, but we have doubts that they can handle terabyte data.
HDFS / Hive / SparkSQL - excellent for batch processing, but doesn't support indexing. The closest thing is partitioning, but it's not applicable given many-to-many relations (e.g. many users visited many URLs). Also, to our knowledge none of HDFS-backed tools except for HBase support updates, so you can only append new data and read latest version, which is not very convenient.
Vertica has very efficient queries, but updates boil down to rewriting the whole file, so are terribly slow.
(Because of limited experience some of information above may be subjective or wrong, please feel free to comment about it)
- Do any of the mentioned databases have useful options that we missed?
- Is there any other database(s) optimized for your use case? If not, how would you address this task?