0

I have written a elastic analyzer by myself, but met some problem when configure the analyzer.

I installed my analyzer by bin/plugin --url file:///[path_to_thulac.jar] --install analysis-smartcn (based on the smartcn, so its name is smartcn). And configure the mapping by

curl -XPUT 'http://localhost:9200/about-index/_mapping/about' -d '
{
    "properties": {
        "searchable_text": {
            "type": "string",
            "analyzer": "smartcn"
        }
    }
}'

When I call curl -XGET 'localhost:9200/_analyze?analyzer=smartcn&pretty' -d '心理学概论' , I got '心理学' & '概论' and it's the answer I wanted.

But when I call the search api

curl 'http://localhost:9200/title-index/_search?pretty=true' -d '{
    "query" : {
        "query_string": {
            "default_field": "searchable_text",
            "query": "心理",
            "analyzer": "smartcn"
        }
    },
    "script_fields": {
        "terms" : {
            "script": "doc[field].values",
            "params": {
                "field": "searchable_text"
            }
        }
    }
}'

I got terms: ["2014", "心理", "概论", "理学", "秋"] I'm so confused with the problem, can someone tell me why? Thank you.

finch
  • 107
  • 8

1 Answers1

0

Your mapping wasn't setup properly. With a properly setup mapping, this record shouldn't be even returned by your query. If you apply the analyzer as shown in the example below:

curl -XDELETE "localhost:9200/test-idx?pretty"
curl -XPUT "localhost:9200/test-idx?pretty" -d '{
    "settings": {
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        }
    },
    "mappings": {
        "doc": {
            "properties": {
                "searchable_text": { "type": "string", "analyzer": "smartcn" }
            }
        }
    }
}
'
curl -XPUT "localhost:9200/test-idx/doc/1?pretty" -d '{
    "searchable_text": "心理学概论2014秋"
}'
curl -XPOST "localhost:9200/test-idx/_refresh?pretty"

The following search request

curl "localhost:9200/test-idx/_search?pretty=true" -d '{
    "query" : {
        "query_string": {
            "default_field": "searchable_text",
            "query": "心理学"
        }
    },
    "script_fields": {
        "terms" : {
            "script": "doc[field].values",
            "params": {
                "field": "searchable_text"
            }
        }
    }
}
'

will return:

"fields" : {
  "terms" : [ [ "2014", "心理学", "概论", "秋" ] ]
}

you get the same result from the analyzer as well:

curl -XGET 'localhost:9200/test-idx/_analyze?field=doc.searchable_text&pretty' -d '心理学概论2014秋'

{
  "tokens" : [ {
    "token" : "心理学",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "概论",
    "start_offset" : 3,
    "end_offset" : 5,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "2014",
    "start_offset" : 5,
    "end_offset" : 9,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "秋",
    "start_offset" : 9,
    "end_offset" : 10,
    "type" : "word",
    "position" : 4
  } ]
}

Execute the following command to make sure your mapping is properly applied:

curl -XGET 'http://localhost:9200/about-index/_mapping'
imotov
  • 28,277
  • 3
  • 90
  • 82
  • The document's searchable_text field is "心理学概论2014秋", I think the tokens are the result after segmented by the tokenizer. – finch Mar 20 '15 at 05:27
  • @dreamszl It looks like the mapping wasn't applied properly, I have updated my answer. – imotov Mar 20 '15 at 16:34