9

I am using a php library of elasticsearch to index and find documents in my website. This is the code for creating the index:

curl -XPUT 'http://localhost:9200/test/' -d '
{
  "index": {
    "numberOfShards": 1,
    "numberOfReplicas": 1
  }
}'

I then use curl XPUT to add documents to the index and XGET to query the index. This works well except for the fact that singulars and plurals of query words are not matched across the index while returning results. For example, when I search for "discussions", the matches for "discussion" are not returned and vice versa. Why is this so? I thought this is taken care of by default in elasticsearch. Is there anything that we have to mention explicitly for it to match the singular/plural forms?

Saeed Zhiany
  • 2,051
  • 9
  • 30
  • 41
Ninja
  • 5,082
  • 6
  • 37
  • 59

3 Answers3

7

The default elascticsearch analyzer doesn't do stemming and this is what you need to handle plural/singular. You can try using Snowball Analyzer for your text fields to see if it works better for your use case:

curl -XPUT 'http://localhost:9200/test' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        }
    },
    "mappings" : {
        "page" : {
            "properties" : {
                "mytextfield": { "type": "string",  "analyzer": "snowball", "store": "yes"}
            }
        }
    }
}'
imotov
  • 28,277
  • 3
  • 90
  • 82
  • I tried this and I get an error: "Message: Failed to load class setting [type] with value [snowball]". Should I be installing something more here? If so what and where from? – Ninja Nov 10 '11 at 17:14
  • Which version of elasticsearch are you using? I tested it on 0.17 and on master and it works fine on both with default settings. Did you modify the command in any way? – imotov Nov 11 '11 at 00:13
  • I am using elasticsearch 0.14. I didn't modify the command- not sure why am getting the error. I used porter stem and it worked for me. I have added the config I used in the answer below. Thanks for your help! – Ninja Nov 14 '11 at 05:58
  • Imotov, need your help on one thing- in the example you have given, you have defined the fields for the stemming (eg. mytextfield in your code). How do I give that stemming should happen on all fields? Would really appreciate your help – Ninja Nov 14 '11 at 22:12
  • 1
    You can create an analyzer called "default" the same way you created "stem" and this analyzer will be applied to all fields by default. Or you can do it with dynamic templates (http://www.elasticsearch.org/guide/reference/mapping/root-object-type.html). It's more complex method but it gives you more flexibility. I think this feature was already present in 0.14, but I am not 100% sure. – imotov Nov 15 '11 at 02:08
  • Hi imotov, I have one more question. If I want to enable highlighting of matches in the search results, do I need to mention anything in the analyzer or only in the query? Will be helpful if you could give an example of how I can enable highlighting for the matches across any field – Ninja Nov 25 '11 at 22:04
7

Somehow snowball is not working for me... am getting errors like I mentioned in the comment to @imotov's answer. I used porter stem and it worked perfectly for me. This is the config I used:

curl -XPUT localhost:9200/index_name -d '
{
"settings" : {
    "analysis" : {
        "analyzer" : {
            "stem" : {
                "tokenizer" : "standard",
                "filter" : ["standard", "lowercase", "stop", "porter_stem"]
            }
        }
    }
},
"mappings" : {
    "index_type_1" : {
        "dynamic" : true,
        "properties" : {
            "field1" : {
                "type" : "string",
                "analyzer" : "stem"
            },
            "field2" : {
                "type" : "string",
                "analyzer" : "stem"
            }
         }
      }
   }
}'
Ninja
  • 5,082
  • 6
  • 37
  • 59
  • thank you so much @ninja for actually putting your mapping in. porter_stem is a lifesaver – caro Dec 02 '21 at 18:53
6

Since 'porterStem' filter is oversensitive, it is more suited if you use 'minimal_english' filter. 'porterStem' creates similar tokens for words such as :

searching for 'Test' will result you 'Test', 'Tests', 'Testing', 'Tester' et. al.

But 'minimal_english' will only yield - 'Test' and 'Tests'.

Himadri Pant
  • 2,171
  • 21
  • 22
  • 2
    Your answer wasn't marked correct since it came much later than the first, but this is obviously a much better solution. snowball analyzer is horribly inaccurate. porterStem is a bit better, and might be usable. kstem is even less sensitive, and minimal_english is the least sensitive. But snowball is horrible. – Henley Jan 12 '14 at 20:15
  • @Sekai in your java code minimal_english can be imported from org.apache.lucene.analysis.en.EnglishMinimalStemFilter and for using in a query it'll be "filter: minimal_english" – Himadri Pant Feb 14 '14 at 01:58