analyzed or not_analyzed, what to choose

Question

I'm using only kibana to search ElasticSearch and i have several fields that can only take a few values (worst case, servername, 30 different values).

I do understand what analyze do to bigger, more complex fields like this, but the small and simple ones i fail to understand the advance/disadvantage of anaylyzed/not_analyzed fields.

So what are the benefits of using analyzed and not_analyzed for a "limited set of values" field (example. servername: server[0-9]* , no special characters to break)? What kind of search types will i lose in kibana? Will i gain any search speed or disk space?

Testing on one of then i saw that the .raw version of the field is now empty but kibana still flags the field as analyzed, so i find my tests inconclusive.

score 34 · Accepted Answer · edited Sep 08 '17 at 02:45

I will to try to keep it simple, if you need more clarification just let me know and I'll elaborate a better answer.

the "analyzed" field is going to create a token using the analyzer that you had defined for that specific table in your mapping. if you are using the default analyzer (as you refer to something without especial characters lets say server[1-9]) using the default analyzer (alnum-lowercase word-braker(this is not the name just what it does basically)) is going to tokenize :

this -> HelloWorld123
into -> token1:helloworld123

OR

this -> Hello World 123
into -> token1:hello && token2:world && token3:123

in this case if you do a search: HeLlO it will become -> "hello" and it will match this document because the token "hello" is there.

in the case of not_analized fields it doesnt apply any tokenizer at all, your token is your keyword so that being said:

this -> Hello World 123
into -> token1:(Hello World 123)

if you search that field for "hello world 123"

is not going to match because is "case sensitive" (you can still use wildcards though (Hello*), lets address that in another time).

in a nutshell:

use "analyzed" fields for fields that you are going to search and you want elasticsearch to score them. example: titles that contain the word "jobs". query:"title:jobs".

doc1 : title:developer jobs in montreal
doc2 : title:java coder jobs in vancuver
doc3 : title:unix designer jobs in toronto
doc4 : title:database manager vacancies in montreal

this is going to retrieve title1 title2 title3.

in those case "analyzed" fields is what you want.

if you know in advance what kind of data would be on that field and you're going to query exactly what you want then "not_analyzed" is what you want.

example:

get all the logs from server123.

query:"server:server123".

doc1 :server:server123,log:randomstring,date:01-jan
doc2 :server:server986,log:randomstring,date:01-jan
doc3 :server:server777,log:randomstring,date:01-jan
doc4 :server:server666,log:randomstring,date:01-jan
doc5 :server:server123,log:randomstring,date:02-jan

results only from server1 and server5.

and well i hope you get the point. as i said keep it simple is about what you need.

analyzed -> more space on disk (LOT MORE if the analyze filds are big). analyzed -> more time for indexation. analyzed -> better for matching documents.

not_analyzed -> less space on disk. not_analyzed -> less time for indexation. not_analyzed -> exact match for fields or using wildcards.

Regards,

Daniel

Thanks for the reply! i have a few questions: Is not_analyzed really case sensitive? i read that kibana do the search always lower case, so having a servername: Server01 would mean that it could not be searched? and how about regexp searches? Finally, for a range of 30 servers does it make any sense to even bother about what type it is? i assume the speed and size change would be minimal... — higuita, May 31 '16 at 18:24
(Not_analozed) It is case sensitive, but it doesn't mean the field is unsearchable, if you store "Server01" then you have to search for "Server01", "server01" won't match the document. When you search for a field that is analyzed, ES will tokenize your search keyword, but if the field that you're searching is not_analized ES won't lowercase it. Finally it doesn't depend on the number of documents but the size of the field, for the "server field" if you're not planning on using special characters at all on that field, you could use analyze just fine ;) — Daniel Andres Acevedo, Jun 05 '16 at 14:14
Imagine that we need to make some aggregation and search stuff in a 'title' propriété... for the aggregation, we don't need to analyze, but for the search engine I think we need... In this case how we can define our document mapping?? — famas23, Oct 12 '18 at 00:53
@famas23 you could add a multi-field level mapping, it allows you to have both analyzed and not_analyzed in the same field, lets say you create "title": { "type": "string", "fields": { "title_raw": { "type": "string", "index" : "not_analyzed" } } } then you can access them as following: for aggregations you will use title.title_raw and for searches you will use title.title please have in mind that the previous syntax reference an old version of elasticsearch. — Daniel Andres Acevedo, Aug 28 '19 at 20:47

analyzed or not_analyzed, what to choose

1 Answers1

Linked