analyzed vs not_analyzed: storage size

Question

I recently started using ElasticSearch 2. And as I undestand analyzed vs not_analyzed in the mapping, not_analyzed should be better in storage (https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0 and https://www.elastic.co/blog/elasticsearch-storage-the-true-story). For testing purposes I created some indexes with all the String field as analyzed (by default) and then I created some other indexes with all the fields as not_analyzed, my surprise came when I checked the size of the indexes and I saw that the indexes with the not_analyzed Strings were 40% bigger!! I was inserting the same documents in each index (35000 docs).

Any idea why this is happening? My documents are simple JSON documents. I have 60 String fields in each document that I want to set as not_analyzed and I tried both setting each field as not analyzed and also creating a dynamic template.

I edit for adding the mapping, although I think it has nothing special:

    {
        "mappings": {
            "my_type" : {
                          "_ttl" : { "enabled" : true, "default" : "7d" },
                          "properties" : {
                                "field1" : {
                                    "properties" : {
                                        "field2" : {
                                            "type" : "string", "index" : "not_analyzed"
                                        }
                                        more not_analyzed String fields here
                                  ...
                              ...
                          ...
}

Can you post the full mapping definition? There may be other properties attributing to this. For example are you setting the store property. In some instances that can cause for duplicate data to be added to the index. — rlcrews, Dec 21 '15 at 21:58
I added part of the mapping, the rest is simply more fields. I didn't add store property or anything else. — ibi2886, Dec 21 '15 at 22:18
To elaborate on both of our answers, the main thing `not_analyzed` is used for is for things like aggregations. If you have (for example) a category field, you want to be able to sort/filter by that category but may not care if someone can really search it. That means you want it to be not analyzed into a set of tokens for searching, but saved as the original string inputted. — Sam, Dec 21 '15 at 22:29

score 8 · Answer 1 · edited Jun 20 '20 at 09:12

8

not_analyzed fields are still indexed. They just don't have any transformations applied to them beforehand ("analysis" - in Lucene parlance).

As an example:

(Doc 1) "The quick brown fox jumped over the lazy dog"

(Doc 2) "Lazy like the fox"

Simplified postings list created by Standard Analyzer (default for analyzed string fields - tokenized, lowercased, stopwords removed):

"brown": [1]  
"dog": [1]  
"fox": [1,2]  
"jumped": [1]  
"lazy": [1,2]  
"over": [1] 
"quick": [1]

30 characters worth of string data

Simplified postings list created by "index": "not_analyzed":

"The quick brown fox jumped over the lazy dog": [1]  
"Lazy like the fox": [2]

62 characters worth of string data

Analysis causes input to get tokenized and normalized for the purpose of being able to look up documents using a term.

But as a result, the unit of text is reduced to a normalized term (vs an entire field with not_analyzed), and all the redundant (normalized) terms across all documents are collapsed into a single logical list saving you all the space that would normally be consumed by repeated terms and stopwords.

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 21 '15 at 22:26

Peter Dixon-Moses

3,169
14
18

Thank you for getting into the nitty-gritty of what an analyzed/raw set of data looks like. I wasn't sure of that for my answer :) – Sam Dec 21 '15 at 22:28
I knew how it works and I thought about this, but I was confused by the fact that in the official documentation they always show that not_analyzed is better than analyzed, in both documents. Your explanation makes me think that is in fact the other way around, analyzed will normally save more space. – ibi2886 Dec 21 '15 at 23:03
1

Analysis is a general umbrella term under which you can stuff any number of text transformations. In the example I gave, the analysis caused a net reduction in text stored. But there are many cases when the analyzed volume of the input text can be much larger (ngrams, shingles, synonym expansion, phonetic expansion, word-delimiter expansion). It really depends what you're doing in the analysis step. It wasn't clear from Peter Kim's blog post, what sort of transformations were happening and on which fields... Or whether the net result *should* be expansive vs reductive. – Peter Dixon-Moses Dec 22 '15 at 01:42
1

Also, analysis generates/adds-to other data structures, e.g. term positions/offsets, skip lists, etc... that wouldn't otherwise have to exist. You can really compare index size for `analyzed` vs `not_analyzed` across the board. Also worth mentioning that if you don't need a field for matching at all, you can use `index: no` in place of `not_analyzed` to keep the data out of the postings list entirely. – Peter Dixon-Moses Dec 22 '15 at 14:10
Ok. Understood. Thanks for the answers! – ibi2886 Dec 22 '15 at 19:39

score 2 · Answer 2 · answered Dec 21 '15 at 22:24

From the documentation, it looks like not_analyzed makes the field act like a "keyword" instead of a "full-text" field -- let's compare these two!

Full text

These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed.

Keyword

Keyword fields are not_analyzed. Instead, the exact string value is added to the index as a single term.

I'm not surprised that storing an entire string as a term, rather than breaking it into a list of terms, doesn't necessarily translate to saved space. Honestly, it probably depends on the index's analyzer and the string being indexed.

As a side note, I just re-indexed about a million documents of production data and cut our index disk space usage by ~95%. The main difference I made was modifying what was actually saved in the source (AKA stored). We indexed PDFs for searching, but did not need them to be returned and so that saved us from saving this information in two different ways (analyzed and raw). There are some very real downsides to this, though, so be careful!

I was so shocked. I expected only about 15-30%, but I guess it makes sense since we have ~150GB of PDFs in storage across our ~million rows. — Sam, Dec 21 '15 at 22:30

score 0 · Answer 3 · answered Nov 29 '18 at 14:07

Doc1{ "name":"my name is mayank kumar" }

Doc2.{ "name":"mayank" }

Doc3.{ "name":"Mayank" }

We have 3 documents.

So if field 'name' is 'not_analyzed' and we search for 'mayank' only second document would be returned.If we search for 'Mayank' only third document would be returned.

If field 'name' is 'analyzed' by a analyser 'lowercase analyser'(just as a example).We we search for 'mayank', all 3 documents would be returned. If we search for 'kumar' ,first document would be returned.This happens because in first document the field value gets tokenised as "my" "name" "is" "mayank" "kumar"

'not_analyzed' is basically used for 'full-text' search(mostly except in wildcards matching).less space on disk.Takes less time during indexing.

'analyzed' is basically used for matching documents.more space on disk (if the analyze fields are big).Takes more time during indexing.(More fields due to analyze fields)

analyzed vs not_analyzed: storage size

3 Answers3

Full text

Keyword

Linked