0

I am using ES 1.7 in production without _source field due to which we are not able to use features like highlight etc, I enabled _source on ES 1.7 which increased the index size by 50%.

Now we are evaluating the ES 7.X and the same exercise when I did with ES 7.X, total index size is increased by 300% which is very surprising and increased our cost estimates a lot.

I am not sure which one is reliable, I know the original index size depends on the type of documents and analyzer which you use(like removing stop words and stemming etc which decreases the index size), but my all data is text, and the same type of index and schema is used in both version of ES, then why there is so much difference in-store w/wo _source ?

Also normally how much index size is expected to grow if we enable _source in indices which contains text ?

Amit
  • 30,756
  • 6
  • 57
  • 88
  • It obviously depends on the size of your source. Can you share your mapping? – Val Oct 15 '19 at 06:10
  • @Val, yeah but same fields and same analyzers are used in ES 1.7 and ES 7.X and all fields are either text or numbers , Its a company data hence can't share the exact mapping file but I guess you got the hint .. – Amit Oct 15 '19 at 06:13
  • There are several years of engineering, new features, and bug fixes between 1.7 and 7.x. So many differences that it's hard to tell what might be the reason without having access to more information. `text` is probably the culprit... there are many things you can configure around that data type. Many data structures have been added in the index at the Lucene level between 1.7 and 7.x. I won't venture into more predictions, because that would look more like wizardry than anything else ;-) – Val Oct 15 '19 at 06:26
  • @Val haha , I agree that there are several years of efforts but then it should be optimized rather than getting worse, I don't see major changes in case of storage of text and I am not customizing this field a lot. problem is that we have a huge amount of data and this increases our disk cost itself by 3 times which is a concern for us. – Amit Oct 15 '19 at 06:30
  • What you call "getting worse" is relative... There are simply more data structures being stored by default (e.g. doc_values since 2.x) in order to facilitate downstream tasks... Have you tried to increase the [compression level](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#index-codec) for instance? – Val Oct 15 '19 at 06:36
  • Also you might not need to enable the _source but simply store the fields you want to highlight on... – Val Oct 15 '19 at 06:39
  • @val, but I need in place reindexing feature as well which would not i guess would not work without `_source`, I know about the compression level but I want to use the default settings for capacity planning and in our current installation also we dont use custom compression level – Amit Oct 15 '19 at 07:04
  • Ok, you didn't mention reindexing in your question, only "etc" :-) A re you sure you're comparing apples to apples, with respect to number of shards, replicas, nodes, etc? – Val Oct 15 '19 at 07:15
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/200885/discussion-between-amit-khandelwal-and-val). – Amit Oct 15 '19 at 08:04

0 Answers0