ElasticSearch or Couchbase or something else

Question

Background: I have a huge stream of data - getting up to 1000000 records per hour, ttl is 3 hours... Each "document" contains approx 20 properties, I need to search up to 15 properties at same time using "==", "IN" and "BETWEEN" comparison.

Since there are mostly no unsearchable properties there are no reason to store document twice (in Couchbase AND in ElasticSearch index) so I think it's a good idea to store it only in ElasticSearch. I'm right?

Or maybe someone can recommend me better database for such task? I need an easy horizontal scaling in future (custom sharding of MySQL is not an option)... This data is some kind of cache so eventual consistency and poor durability is OK...

According CAP theorem I need mostly A and P...

Does the queries change on the fly or will they always be querying the same properties with roughly the same values? — scalabilitysolved, Jul 23 '14 at 19:12
the system I'm working on is "tour aggregator/search" and data items is actually tours containing: departuredate,depatturecountry,duration,resort,hotelCategory,mealType,price,hotel etc. Most of times peoples search for cheapest tours from concrete departure city to concrete country (or resort) in concrete departure date range. — dimzon, Jul 23 '14 at 21:22

score 11 · Accepted Answer · edited Mar 06 '18 at 11:39

Regarding performance, provided you use appropriately sized hardware you should not have issues indexing 1M documents per hour. I've run Elasticsearch well above that with no issues. There is a detailed writeup here that you may find useful concerning benchmarking and sizing a large Elasticsearch cluster:

ElasticSearch setup for a large cluster with heavy aggregations

For an ephemeral caching system with a TTL of only 3 hours I agree it would be a waste to store the data in more than one repository. You could store the data in Couchbase and replicate it into Elasticsearch in realtime or near real time, but why bother with that? Not certain what benefit you would get from having the data in both places.

For performance issues concerning your specific use case I'd strongly suggest benchmarking. One strength of Elasticsearch (and Solr too) that I've found is their (to me) surprisingly strong performance when search on multiple, non-text fields. You tend to think of ES for text search purposes (where it does excel) but it's also a decent general purpose database too. I've found that in particular it has strong performance when searching on multiple parameters compared to some other NoSQL solutions.

Personally when benchmarking ES in this use case I'd look at a number of different indexing options. ES supports TTL for documents so automatically purging the cache is easy:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html

However you may want to play around with having different indexes for each each hour - one thing about ES (due to it's use of Lucene underneath for indexing and file storage) is that deletes work differently than most databases. Documents are marked deleted but not removed, and then periodically the files underneath (called segments) will be merged, at which time new segments will be created without the deleted documents. This can cause a fair amount of disk activity for high volume delete heavy use cases in a single index. The way around this is to create a new index for each hour and then deleting the index in it's entirety after the data in it is over 3 hours old.

You may find this previous discussion about TTL vs. time series indexes in Elasticsearch useful: Performance issues using Elasticsearch as a time window storage

Finally, regarding easy horizontal scaling Elasticsearch is pretty good here - you add a new node with the correct cluster name and ES takes care of the rest, automatically migrating shards to the new node. In your use case, you may want to play with the replication factor, as more replicas across more nodes are the easy way to boost query performance.

score 2 · Answer 2 · answered Jul 23 '14 at 04:28

For the use case of a cache (cache-like system), I think Elasticsearch will only give you problems in the future. I assume you don't need indexing at all as you are not looking at search like features.

I haven't used Couchbase but I have heard good things about it. I have heard use-cases like using Couchbase for more filtering kind of purposes and Elasticsearch for more search-like purpose (and things that Couchbase can't do).

For scalability, as far as I can tell both look similar from a very high level point-of-view. Both support easy sharding and replication with re-balancing of shards and secondary replica promotion to primary when a node in the cluster goes down. The specifics may be different.

But in all honesty, you will have to try it out yourself and test with production like traffic. I have worked with Elasticsearch and I know that you can't always just say if it is the right choice for your use-case because how it behaves for an application in production may be different for how it behaves for another in terms of performance.

But I think you are on the right track.

ElasticSearch or Couchbase or something else

2 Answers2