How to prevent Facet Terms from tokenizing

Question

I am using Facet Terms to get all the unique values and their count for a field. And I am getting wrong results.

term: web 
Count: 1191979 
term: misc 
Count: 1191979 
term: passwd 
Count: 1191979 
term: etc 
Count: 1191979

While the actual result should be:

term: WEB-MISC /etc/passwd 
Count: 1191979

Here is my sample query:

{
  "facets": {
    "terms1": {
      "terms": {
        "field": "message"
      }
    }
  }
}

Could you update the question with a _short_ example of the data and a _short_ example of the query you're doing, so it's more informative for users coming here from Google searches etc? — karmi, Apr 11 '12 at 09:14

imotov · Accepted Answer · 2012-04-11T10:33:45.860

15

If reindexing is an option, it would be the best to change mapping and mark this fields as not_analyzed

"your_field" : { "type": "string", "index" : "not_analyzed" }

You can use multi field type if keeping an analyzed version of the field is desired:

"your_field" : {
  "type" : "multi_field",
    "fields" : {
      "your_field" : {"type" : "string", "index" : "analyzed"},
      "untouched" : {"type" : "string", "index" : "not_analyzed"}
  }
}

This way, you can continue using your_field in the queries, while running facet searches using your_field.untouched.

Alternatively, if this field is stored, you can use a script field facet instead:

"facets" : {
  "term" : {
    "terms" : {
      "script_field" : "_fields.your_field.value"
    }
  }
}

As the last resort, if this field is not stored, but record source is stored in the index, you can try this:

"facets" : {
  "term" : {
    "terms" : {
      "script_field" : "_source.your_field"
    }
  }
}

The first solution is the most efficient. The last solution is the least efficient and may take a lot of time on a large index.

edited Apr 11 '12 at 10:33

answered Apr 10 '12 at 17:57

imotov

28,277
3
90
82

I tried the script_field but it seemed to produce an error. My current query looks like this though: http://www.pastebin.com/XwJMM7Eq – jmnwong Apr 10 '12 at 19:48
It, probably, gives you "unresolved property of identifier: logsource" error. That's because elasticsearch script doesn't know what 'logsource' means. Try replacing it with _fields.logsource – imotov Apr 10 '12 at 19:57
Shows up as "term" "org.elasticsearch.search.lookup.FieldLookup@1209016" – jmnwong Apr 10 '12 at 20:13
Sorry, I meant _field.logsource.value See http://www.elasticsearch.org/guide/reference/modules/scripting.html – imotov Apr 10 '12 at 20:34
Turns out that I had to re-index it with the "index":"not_analyzed" field. That did the trick. Thanks! – jmnwong Apr 10 '12 at 22:34
1

Ivan, great info, maybe adding short info about "multifielding" the field would be a nice overview for the solution of the "my facets are broken how do I fix that" problem? – karmi Apr 11 '12 at 09:13
Great idea, Klement ;) I have added a multifield example to my answer. – imotov Apr 11 '12 at 10:36

Mohan Kumar · Answer 2 · 2016-01-07T07:06:11.187

Wow, I also got this same issue today while term aggregating in the recent elastic-search. After googling and some partial understanding, found how this geeky indexing works(which is very simple).

Queries can find only terms that actually exist in the inverted index

When you index the following string

"WEB-MISC /etc/passwd"

it will be passed to an analyzer. The analyzer might tokenize it into

"WEB", "MISC", "etc" and "passwd"

with its position details. And this tokens might filtered to lowercase such as

"web", "misc", "etc" and "passwd"

So, after indexing,the search query can see the above 4 only. not the complete word "WEB-MISC /etc/passwd". For your requirement the following are my options you can use

1.Change the Default Analyzer used by elasticsearch([link][1])
2.If it is not need, just TurnOff the analyzer by setting 'not_analyzed' for the fields you need
3.To convert the already indexed data searchable, re-indexing is the only option

Vineeth Mohan · Answer 3 · 2015-11-01T16:42:08.713

-1

I have briefly explained this problem and proposed two solutions here. I have talked about multiple approaches here. One is use of not_analyzed to preserve the string as it is. But then as it has the drawback of being case insensitive , a better approach would be use keyword tokenizer + lowercase filter

edited Nov 01 '15 at 16:42

answered Oct 09 '15 at 10:35

Vineeth Mohan

18,633
8
63
77

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Leigh Oct 10 '15 at 03:10
I have briefed my answer. – Vineeth Mohan Nov 01 '15 at 16:42

How to prevent Facet Terms from tokenizing

3 Answers3