ElasticSearch-Haystack: Spanish Tokenizer "Fails"

Question

I'm using:

Haystack - 2.1.0
ElasticSearch - 0.90.3
pyelasticsearch - 0.6

I've configured a custom backend to change default Elasticsearch settings and use Spanish analyzer.

I'm using this settings for Elasticsearch:

"settings" : {

        "index": {
            "uuid": "IPwcMthwRpSJzpjtarc9eQ",
            "analysis": {
               "analyzer": {
                  "default": {
                     "filter": ["standard", "lowercase", "asciifolding", ],
                     "tokenizer": "standard"
                  }
               }
            },
            "number_of_replicas": "1",
            "number_of_shards": "10",
        }
    },
    "analyzer": {
            "spanish": {
              "tokenizer":  "standard",
              "filter": [
                "lowercase",
                "spanish_stop",
                "spanish_keywords",
                "spanish_stemmer"
              ]
            }
    }

I read this settings in some answer here. When I apply this settings to ElasticSearch and reindex my models I get a behaviour that I'm not sure I understand.

I have some objects with names like "Ciencias" and others like "Ciéncies" When I do a search like "ciencias" I receive objects with names like "Ciencias" and "Ciéncies", and the same happens when I search for "ciencies" or "ciéncies".

I want ElasticSearch to ignore accents, that's why I'm using asciifolding, and using spanish tokenizer because most of text is in spanish. I don't understand why using different words like "cienciAs" and "cienciEs" receive same results.

Why is this happening ? Is because a default ngram analyzer that is splitting the words ?

Why searching for "cienciAs" I get object with name like "ciénciEs" as results ?

score 1 · Answer 1 · answered Aug 20 '14 at 09:05

1

Probably because the stemmer is doing its job. If you want to find out what happens while tokenising or stemming, install the inquisitor plugin and go to the Analyzers tab (see here)

answered Aug 20 '14 at 09:05

mjl

197
3

Thanks for your answer, I've never used Inquisitor plugin and seems really helpfull! In the analyzer tab the only analyzer that split the words "ciencias" or "ciéncies" is `Snowball` analyzer, but It shouldn't be using that analyzer – AlvaroAV Aug 20 '14 at 09:15
A bit unrelated to your original question, but check out the icu-analysis plugin, it can do collation correctly in all kind of languages, you are not limited using asciifolding – mjl Aug 20 '14 at 09:28
Sorry for my english as you can supose it's not my first language. I disabled the stemmer and still getting same results, I'm looking for better configurations for the analyzer. – AlvaroAV Aug 21 '14 at 06:31

score 0 · Accepted Answer · answered Aug 25 '14 at 07:48

0

Finally I removed the Spanish analyzer and everything began to work as expected.

Now I'm using only Asciifolding and Lowercase filters and accents and ñ's are being indexed well, and I don't have the issue with ciencias and ciencies.

answered Aug 25 '14 at 07:48

AlvaroAV

10,335
12
60
91

ElasticSearch-Haystack: Spanish Tokenizer "Fails"

2 Answers2