0

My task is: * Make procter&gamble and procter & gamble produce the same results including score * Make it universal, not via synonyms, as it can be any other Somehow&Somewhat * Highlight procter&gamble or procter & gamble, not separate tokens if the phrase matches * I want to use simple_query_stringas I allow search operators * Make AT&T searchable as well

Here is my snippet. The problems that procter&gamble or procter & gamble searches produce different scores and this different documents as the result. But the user expects the same result for procter&gamble or procter & gamble

DELETE /english_example
PUT /english_example
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"] 
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        },
        "acronymns": {
          "type": "word_delimiter_graph",
          "catenate_all" : true,
          "preserve_original":true
        },
        "acronymns_": {
          "type": "word_delimiter_graph",
          "catenate_all" : true,
          "preserve_original":true
        },
        "custom_stop_words_filter": {
          "type": "stop",
          "ignore_case": true,
          "stopwords": [ "t" ]
        }

      },
      "analyzer": {
        "default": {
          "tokenizer":  "whitespace",
          "char_filter": [
           "ampersand_filter"
          ],
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "acronymns",
            "flatten_graph",
            "english_stop",
            "custom_stop_words_filter",
            "english_keywords",
            "english_stemmer"
          ]
        }
      },
      "char_filter": {
        "ampersand_filter": {
          "type": "pattern_replace",
          "pattern": "(?=[^&]*)( {0,}& {0,})(?=[^&]*)",
          "replacement": "_and_"
        },
        "ampersand_filter2": {
          "type": "mapping",
          "mappings": [
            "& => _and_"
          ]
        }
      }
    }
  }
}
PUT /english_example/_bulk 
{ "index" : { "_id" : "1" } }
{ "description" : "wi-fi AT&T BB&T Procter & Gamble, some\nOther $500 games with Peter's", "contents" : "Much text with somewhere I meet Procter or Gamble" }
{ "index" : { "_id" : "2" } }
{ "description" : "Procter & Gamble", "contents" : "Much text with somewhere I meet Procter and Gamble" }
{ "index" : { "_id" : "3" } }
{ "description" : "Procter&Gamble", "contents" : "Much text with somewhere I meet Procter & Gamble" }
{ "index" : { "_id" : "4" } }
{ "description" : "Come Procter&Gamble", "contents" : "Much text with somewhere I meet Procter&Gamble" }
{ "index" : { "_id" : "5" } }
{ "description" : "Tome Procter & Gamble", "contents" : "Much text with somewhere I don't meet AT&T" }


# "query": "procter & gamble",
GET english_example/_search
{
    "query": {
      "simple_query_string": {
          "query": "procter & gamble",
          "default_operator": "or",
          "fields": [
            "description^2",
            "contents^80"
          ]
      }
    },
    "highlight": {
      "fields": {
        "description": {},
        "contents": {}
      }
    }
}


# "query": "procter&gamble",
GET english_example/_search
{
    "query": {
      "simple_query_string": {
          "query": "procter&gamble",
          "default_operator": "or",
          "fields": [
            "description^2",
            "contents^80"
          ]
      }
    },
    "highlight": {
      "fields": {
        "description": {},
        "contents": {}
      }
    }
}


# "query": "at&t",
GET english_example/_search
{
    "query": {
      "simple_query_string": {
          "query": "at&t",
          "default_operator": "or",
          "fields": [
            "description^2",
            "contents^80"
          ]
      }
    },
    "highlight": {
      "fields": {
        "description": {},
        "contents": {}
      }
    }
}

In my snippet I redefine the default analyzer using word_delimiter_graph and whitespace tokenizer to search AT&T matches as well.

AHeavyObject
  • 562
  • 1
  • 7
  • 18

2 Answers2

0

I just realized that you are searching a description field and not a company field. So keyword analyzer wont work. I have updated my answer accordingly.

You can potentially try adding a custom field with lowercase and whitespace analyzer and use the same custom analyzer for search as well. When you perform search, search in both standard field and this custom field as a multimatch search. That should allow you to support both. You can boost the score for custom field so that exact matches comes in the top of the search results.

Trick is to convert user input to lower case before performing the search. You shouldn't use user input as is. Else this approach wont work.

You can use below scripts to try it out.

DELETE /test1
PUT /test1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "lowercase_analyzer" : {
          "filter" : ["lowercase"],
          "type" : "custom",
          "tokenizer" : "whitespace"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "description" : {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "custom" : {
            "type" : "text",
            "analyzer" : "lowercase_analyzer",
            "search_analyzer" : "lowercase_analyzer"
          }
        }
      }
    }
  }
}

PUT /test1/_bulk
{ "index" : { "_id" : "1" } }
{ "description" : "wi-fi AT&T BB&T Procter & Gamble, some\nOther $500 games with Peter's" }
{ "index" : { "_id" : "2" } }
{ "description" : "Procter & Gamble" }
{ "index" : { "_id" : "3" } }
{ "description" : "Procter&Gamble" }

GET test1/_search
{
  "query": {
    "multi_match": {
      "query": "procter&gamble",
      "fields": ["description", "description.custom"]
    }
  },
  "highlight": {
    "fields": {
      "description": {},
      "description.custom": {}
    }
  }
}


GET test1/_search
{
  "query": {
    "multi_match": {
      "query": "procter",
      "fields": ["description", "description.custom"]
    }
  },
  "highlight": {
    "fields": {
      "description": {},
      "description.custom": {}
    }
  }
}

GET test1/_search
{
  "query": {
    "multi_match": {
      "query": "at&t",
      "fields": ["description", "description.custom"]
    }
  },
  "highlight": {
    "fields": {
      "description": {},
      "description.custom": {}
    }
  }
}

GET test1/_search
{
  "query": {
    "multi_match": {
      "query": "procter & gamble",
      "fields": ["description", "description.custom"]
    }
  },
  "highlight": {
    "fields": {
      "description": {},
      "description.custom": {}
    }
  }
}

You can add highlighting and try it out.

askids
  • 1,406
  • 1
  • 15
  • 32
  • 1
    Did you try it before writing here? Thought I was sure it wouldn't work, I tried as you described. https://paste.opensuse.org/view/simple/29492497 And surely it didn't work. It matched exact the same 3-rd result only. – AHeavyObject May 11 '20 at 00:55
  • We have been using similar approach for few years now. There are some gotcha things to do. Else it wont work. So probably, just that description was not sufficient to try it on your own. Let me modify my answer. – askids May 11 '20 at 05:02
  • Also, to add, we dont use highlighting. So it works well for us, in terms of finding the match. I realize the highlighting may be an issue for your usecase, even though search will work. – askids May 11 '20 at 05:33
  • Alas askids it doesn't work as expected Here I get results with good hightlights and matching expected documents https://paste.opensuse.org/view/simple/34835555 But score values are different for `procter&gamble` and `procter & gamble`. So a user gets different ordering and different documents on similar requests. – AHeavyObject May 11 '20 at 08:12
  • From your scripts, if I get rid of custom stop word filter, i am getting results. Please check. Also, i verified that scores as same for both matching docs. So that should be fine. If you need to boost exact match, try the multi-match route (from my sample) where you can boost the custom field. – askids May 11 '20 at 08:32
  • I tried removing the custom stop filter and it doesn't affect scores. Please, provide the code you mean. Here is my code where I get different order for 'procter&gamble' and 'procter & gamble' Can this be fixed or at least explained? https://paste.opensuse.org/view/simple/65217922 – AHeavyObject May 11 '20 at 08:41
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/213601/discussion-between-gruz-and-askids). – AHeavyObject May 11 '20 at 08:43
0

One option I can think of is to use a should query with a "standard analyzer" and your custom analyzer.

For "proctor & gamble" tokens generated using custom and standard analyzer will be "proctor","gamble" For "proctor&gamble" tokens generated using custom analyzer will be "proctor","gamble","proctor&gamble" and using standard analyzer will "proctor" and "gamble"

So in should clause we can use a standard analyzer to look for "proctor" or "gamble" and a custom analyzer to look for "proctor&gamble"

GET english_example/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "description": {
              "query": "Procter&Gamble",
              "analyzer": "standard"
            }
          }
        },
        {
          "match": {
            "description": {
              "query": "Procter&Gamble"
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

Second option will be to use synonymns where you define all variations in which proctor and gamble can appear to mean a single thing

jaspreet chahal
  • 8,817
  • 2
  • 11
  • 29
  • 1
    This doesn't fit. For `procter&gamble` it highlights only `procter`. Besides it gives totally different score values and thus other list of documents. A user searching `procter&gamble` and `procter & gamble` expects to get same results and to highlight the correct match. Synonyms surely cannot be an option. A general solution is needed. There is unexpected number of Something&Something variants. – AHeavyObject May 11 '20 at 02:30