0

Let's say I have 100,000 documents from different customer groups, which are formatted the same with the same type of information.

Documents from individual customer groups get refreshed at different times of the day. I've been recommended to give each customer group their own index so when my individual customer index is refreshed locally I can create a new index for that customer and delete the old index for that customer.

What are the implications for splitting the data into multiple indexes and querying using an alias? Specifically:

  • Will it increase my server HDD requirements?
  • Will it increase my server RAM requirements?
  • Will elasticsearch be slower to search by querying the alias to query all the indexes?

Thank you for any help or advice.

Jimmy
  • 12,087
  • 28
  • 102
  • 192
  • 1
    How many indices in the end? – Andrei Stefan Apr 07 '15 at 20:21
  • @AndreiStefan thank you for the comment. Its hard to say. To start with about 10 but in the future it may increase significantly. – Jimmy Apr 07 '15 at 20:33
  • 1
    The idea is that each node can hold a certain number of shards. And depending on how you use those indices (how often, indexing/searching, how frequently) the maximum number of shards a node can hold varies. Into play also comes how many shards each index is configured to have and how many replicas. This would be the reason for my initial question. If by "refresh" you mean changing all the documents of that index, then I think it would be more efficient to build a new index, indeed. But keep in mind the number of shards. – Andrei Stefan Apr 07 '15 at 20:38
  • @AndreiStefan When happens in a refresh is say there 100,000 items, about 10k of those will no longer appear (I.e items in the index need removing as they are no longer in new data source) and another 10k new entries. Do you think this still lends itself to a full index rebuild or is it sensible to work with one index and just update that. Sorry for the follow up question. – Jimmy Apr 07 '15 at 20:42
  • 1
    How many nodes in the cluster? And what are the predictions for number of docs increase and number of customers? – Andrei Stefan Apr 07 '15 at 20:44
  • @AndreiStefan Currently one in the cluster (one elasticsearch default install on an 8gb RAM server) – Jimmy Apr 07 '15 at 20:46
  • 1
    If the number of customers will increase significantly and you have no plans in adding more nodes, then stick with one index only. – Andrei Stefan Apr 07 '15 at 20:52
  • possible duplicate of [In ElasticSearch, should I use multiple indexes for separate but related entities?](http://stackoverflow.com/questions/14221159/in-elasticsearch-should-i-use-multiple-indexes-for-separate-but-related-entitie) – von v. Aug 12 '15 at 22:47

1 Answers1

2

Every index has some overhead on all levels but it's usually small. For 100,000 documents I would question the need for splitting unless these documents are very large. In general each added index will:

  1. Require some amount of RAM for insert buffers and other per-index related tasks

  2. Have it's own merge overhead on disk relative to a larger single index

  3. Provide some latency increase at query time due to result merging if a query spans multiple indexes

There are a lot of factors that go into determining if any of these are significant. If you have lots of RAM and several CPUs and SSDs then you may be fine.

I would advise you to build a solution that uses the minimum number of shards as possible. That probably means one (or at least only a few) index(es).

Andrew White
  • 52,720
  • 19
  • 113
  • 137