0

Elasticsearch only finding hits with ".keyword" appended

I'm having a terrible time with querying an elasticsearch 5 instance full of fluentd log entries that I imported from an older elasticsearch instance running version 1.7. Queries through Kibana for the simplest things frequently time out, and I'm completely in the dark for where to look to investigate potential performance issues. A sampling of the elasticsearch mappings for the index I'm querying looks like this:

=> {"@log_name"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}},
 "@timestamp"=>{"type"=>"date"},
 "@version"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}},
 "action"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}},
 "api"=>{"type"=>"boolean"},
 "controller"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}},
 "db"=>{"type"=>"float"},
 "duration"=>{"type"=>"float"},
 "error"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}},
 "filtered_params"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}},
 "user"=>
  {"properties"=>
    {"email"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}},
     "snowflake_id"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}},
     "snowflake_uid"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}},
     "type"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}},
  ...

With that in place, I can query the index by using curl with something like to return the total number of documents found:

curl -s -XGET 'localhost:9200/logstash-2017.08.15/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "user.email": "user@example.com" 
          }
        }
      ]
    }
  }
}
' | jq ".hits.total | length"

0

Meaning that 0 documents were found. However, if I replace the user.email term with user.email.keyword, the query returns a total number of 40.

I guess my main question is: How do I know if my mappings are correct for this data? (For the imported data, they were created as the data was inserted at insert time, and I'm assuming that going forward, they are created automatically)

matt
  • 9,113
  • 3
  • 44
  • 46

1 Answers1

2

The user.email field is type text. When indexing into a field of this type, an analyzer splits and transforms the source value into one or more terms. Each term is stored in an index to allow searches for that term. The mapping does not specify an analyzer for the field, so the default analyzer is used. To show the terms output by the default analyzer, invoke

curl -s -XGET http://localhost:9200/logstash-2017.08.15/_analyze -d'{"text": "user@example.com"}' | jq . 

Following your example, searching the user.email field for the term user will probably find results.

The user.email.keyword subfield is type keyword. Fields of this type are only searchable by their exact value. That is, the value specified in the search query must exactly equal the original source value.

Chin Huang
  • 12,912
  • 4
  • 46
  • 47
  • Oh, wow - that's incredibly helpful - i didn't know that ES would break apart text fields to only be searchable by their token values. However, I'm still stuck on why querying across all of my indices is so slow - that is, `curl -s -XGET 'localhost:9200/logstash-*` times out, while searching across `curl -s -XGET 'localhost:9200/logstash-2017.09*` does not. I don't have concrete proof yet, but it seems that aggregating the docs from before the import and after is affecting performance. – matt Sep 06 '17 at 13:01