1

I'm preparing an in-site search engine with elasticsearch and I'm new to elasticsearch. Sites which will use this engine are Turkish / English.

In Turkey, we have Turkish letters like 'ğ', 'ü', 'ş', 'ı', 'ö', 'ç'. But when we search generally we use the letters 'g', 'u', 's', 'i', 'o', 'c'. This is not a rule but we generally do it, think like a habit, something we used to.

Now, I have a document type called "product" and this type has several string properties and some are nested. For example:

public class Product {
    public string ProductName { get; set; }
    public Category Category { get; set; }
    //...
}
public class Category {
    public string CategoryName { get; set; }
    //...
}

My goal is this:

  • ProductName or Category.CategoryName may contain Turkish letters ("Eşarp") or some may be mistyped and written with English letters ("Esarp")
  • Querystring may contain Turkish letters ("eşarp") or not ("esarp")
  • Querystring may have multiple words
  • Every indexed string field should be searched against querystring (full-text search)

Now, what I did:

  • While creating index, I also configure mappings and used a custom analyzer called "sanalyze" which uses "lowercase" and "asciifolding" filters and standard tokenizer instead of standard analyzer.
  • Used that custom analyzer for string fields mappings.

Example code for mapping:

// some more mappings which uses the same mapping for all string fields.
.Map<Yaziylabir.Extensions.TagManagement.Models.TagModel>(m => m.AutoMap().Properties(p => p
    .String(s => s
        .Name(n => n.Tag).Analyzer("sanalyze")))))
.Settings(s => s
    .Analysis(ans => ans
        .Analyzers(anl => anl
            .Custom("sanalyze", c => c
                .Tokenizer("standard")
                .Filters("lowercase", "asciifolding")))));
  • I deleted, recreated and indexed my index
  • Now I'm trying to search in that index.

I tried with two different query to search against stored documents:

q &= Query<ProductModel>.QueryString(t => t.Query(Keyword).Analyzer("sanalyze"));

q &= Query<ProductModel>.QueryString(t => t.Query(Keyword));

The second doesn't use Analyzer method because in elasticsearch documentation, it says that elasticsearch will use the analyzer used on a field. So I think there is no need to define it again while searching.

What I got as result:

  • First query (with Analyzer("sanalyze")): When I search "eşarp" or "esarp", No results. When I search "bordo", I got results.
  • Second query (without analyzer("sanalyze")): When I search "eşarp", I got results. When I search "esarp", No results. When I search "bordo", I got results.

BTW:

  • Documents contain "Eşarp" as ProductName value and when I checked elasticsearch created "esarp" field term.

  • Documents contain "Bordo" as value and "bordo" as field term.

I couldn't achive what I want. What do I do wrong? - Should I use another filter instead of asciifolding? - Should I use preserveOriginal with asciifolding? I don't want to use that option to not to screw scores. - Something different to do?

Can you please help me?

If you think it is not clear what I'm asking, please tell me, I will try to make it clearer.

Thank you.

zokkan
  • 193
  • 2
  • 15
  • @RussCam, this is my new question :-) If you can help me, I'd be most grateful. – zokkan May 30 '16 at 11:52
  • it looks like you have an encoding issue. Ascii encoding removes not printable characters. So you only get characters 0-127 and not 128 - 255 which is where the non standard Arabic character are located. I'm not sure if you text also may contain unicode characters. I have seen same issue using ToString() method which also uses Ascii encoding. – jdweng May 30 '16 at 11:56
  • @jdweng but there is a thing which confuses me. When a property has value like "Eşarp", I check and verify that "esarp" is created as a term/token. So in my logic, the sanalyze alanyzer works good to index. While searching I guess I need to use something (I don't know what) to do the same to querystring as what sanalyze analyzer does to string fields while indexing and then search that analyzed querystring in indexed terms. Am I wrong? Like if "eşarp" is indexed and saved "esarp" as a term then if I use "eşarp" as querystring, if it could be searched like "esarp" there should be no problem? – zokkan May 30 '16 at 12:04
  • Any string search method should work. The filtering is causing issues. I don't know Turkish, but have seen many strange issues with different languages. Some languages have more than one UpperCase/Lower case for characters. asciifolding will ignore some characters so it will work in some instances and not in others. You may need to create your own Upper/Lower Case method to resolve issue.I wouldn't use the asciiFolding. – jdweng May 30 '16 at 12:28
  • @zokkan- this may help you - http://stackoverflow.com/a/37525868/1831 – Russ Cam May 30 '16 at 12:38
  • Hi @RussCam, I tried that too, problem still exists. I can search for "Ayşe" and get results but couldn't get results for "Ayse" search. I don't understand this actually, it indexes "Ayşe" as both "ayse" and "ayşe" but no luck. is it a bug or something? – zokkan May 30 '16 at 14:02

1 Answers1

1

Using the default settings for query_string means you are searching in the _all field. The _all field has its own analyzer - the standard one.

You need to specify on which field you want query_string to act on:

  "query": {
    "query_string": {
      "query": "your_field_name:esarp"
    }
  }

or

  "query": {
    "query_string": {
      "query": "esarp",
      "default_field": "your_field_name"
    }
  }
Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89
  • Hi @Andrei can I specify analyzer for _all field in mapping? Because I need to search lots of properties and some are located in nested properties. And honestly I dont know how to make a fulltext search which covers both string properties of document and nested ones. If you can, please give some examples to do so. I will try tomorrow, it is 10pm here right now :-) – zokkan May 30 '16 at 19:13
  • @zokkan, yes you can specify an analyzer for the `_all` field using `.AllField(a => a.Analyzer("folding-analyzer"))` on the type mapping descriptor for your type. See https://github.com/elastic/elasticsearch-net/blob/2.x/src/Nest/Mapping/TypeMapping.cs#L177 – Russ Cam May 31 '16 at 06:15
  • Hi @Andrei I have tons of nested properties in my document type, should I also do anything about _all with these nested properties? – zokkan May 31 '16 at 11:05
  • nested fields will use the root document `_all` field only so if you want them included in `_all` and analyzed as well, you don't need to do anything. Using `_all` can be a blunt instrument for some use cases; you can create your own custom `_all` fields using `copy_to` - https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-all.html – Russ Cam May 31 '16 at 13:07