Database with high throughput, efficient random access and queries on secondary index

Question

We have ~1Tb of user profiles and need to perform two types operations on them:

random reads and writes (~20k profile updates per second)
queries on predefined dimensions (e.g. for reporting)

For example, if we encounter user in a transaction, we want to update his profile with a URL he came from. At the end of the day we want to see all users who visited particular URL. We don't need joins, aggregations, etc., only filtering by one or several fields.

We don't really care about latency, but need high throughput.

Most databases we looked at belong to one of two categories - key-value DBs with fast random access or batch DBs optimized for querying and analytics.

Key-value storages

Aerospike can store terabyte-scale data and is very well-optimized for fast key-based lookup. However, queries on secondary index are deadly slow, which makes it unsuitable for our purposes.

MongoDB is pretty flexible, but requires too much hardware to handle our load. In addition, we encountered particular issues with massive exports from it.

HBase looks attractive since we already have Hadoop cluster. Yet, it's not really clear how to create secondary index for it and what its performance will be.

Cassandra - may be an option, but we don't have experience with it (if you do, please share it)

Couchbase - may be an option, but we don't have experience with it (if you do, please share it)

Analytic storages

Relational DBMS (e.g. Oracle, PostreSQL) provide both - random access and efficient queries, but we have doubts that they can handle terabyte data.

HDFS / Hive / SparkSQL - excellent for batch processing, but doesn't support indexing. The closest thing is partitioning, but it's not applicable given many-to-many relations (e.g. many users visited many URLs). Also, to our knowledge none of HDFS-backed tools except for HBase support updates, so you can only append new data and read latest version, which is not very convenient.

Vertica has very efficient queries, but updates boil down to rewriting the whole file, so are terribly slow.

(Because of limited experience some of information above may be subjective or wrong, please feel free to comment about it)

Do any of the mentioned databases have useful options that we missed?
Is there any other database(s) optimized for your use case? If not, how would you address this task?

We tried couchbase. It was great until our total data outgrew memory capacity. I heard they got support for SSD persistence nowadays, though. You kinda need the best of both worlds. Have you thought about using a K-V-Store that async. feeds an Analytical Storage only with the query-relevant data? Analytics could happen in-memory if you got everything persisted in KV and can re-create in case you really need to. Async updates + In-Memory could help a lot with throughput on Analytic Storage. — Manuel Arwed Schmidt, Jan 23 '16 at 00:42
@ManuelArwedSchmidt: I see that Couchbase supports [persistence to disk](http://docs.couchbase.com/admin/admin/Concepts/concept-diskStorage.html), so from your words I suppose it also requires all the data to be in memory as well? Yes, we thought about double storage, but in this case you 1) send only profile updates to analytical storage, which is inconvenient to analyze, or 2) send full profiles, thus duplicating data many times, or 3) do query to KV storage, which we are looking for anyway. — ffriend, Jan 24 '16 at 22:52
CB is bottlenecked on random disk reads if dataset outgrows memory (back then we used HDDs...) but doesn't require all data in memory. Perfect match for data sets with cold data. You'd need 'full' profiles in analytics. By full I mean all columns you want to query on. I'd keep complete profiles in KV and do a batch read after analytics returns me a set of ids for page 1 / 20913 after filtering, aggregating and sorting on the 'indexed columns'. Get the idea? It's about moving as much work as possible away from analytics - this e.g. reduces row size which is a big deal with [..] — Manuel Arwed Schmidt, Jan 25 '16 at 00:08
[...] row-based SQL DBs and removing need for persistence (O_SYNC is a major bottleneck!) can also help to push throughput on analytic DB atleast a magnitude beyond what it would be capable of if it needs to provide persistence. Anyways this question is very interesting. Idk why it didn't catch a lot of attention - it would have deserved that. — Manuel Arwed Schmidt, Jan 25 '16 at 00:11
So it sounds like external index for KV storage. We need to test performance, but all in all it sounds pretty reasonable. Thanks! — ffriend, Jan 25 '16 at 11:55
Yes, external index :) It'd be interesting to hear what performance you achieved at a data set size of x (size in bytes & rowcount & amount of indexes). We are working on something similar but couldn't benchmark yet. — Manuel Arwed Schmidt, Jan 26 '16 at 12:31

Database with high throughput, efficient random access and queries on secondary index

Key-value storages

Analytic storages

0 Answers0