0

Suppose that my index has two documents:

  1. "get my money"
  2. "my money get here"

When I do a regular match query for "get my money", both documents match correctly but they get equal scores. However, I want the order of words to be significant during scoring. In other words, I want "get my money" to have a higher score.

So I tried putting my match query inside the must clause of a bool query and included a match_phrase (with the same query string). This seems to score hits correctly until I do a search with "how do I get my money". In that case, match_phrase query doesn't seem to match, and the hits are returned with equal scores again.

How can I construct my index/query so that it takes word order into account but does not require all searched words to exist in document?

Index mapping with test data

PUT test-index
{
  "mappings": {
      "properties" : {
        "keyword" : {
          "type" : "text",
          "similarity": "boolean"
        }
      }
    }
}
POST test-index/_doc/
{
    "keyword" : "get my money"
}
POST test-index/_doc/
{
    "keyword" : "my money get here"
}

Query "How do I get my money" - Doesn't work as needed

GET /test-index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "keyword": "how do i get my money"
          }
        }
      ],
      "should": [
        {
          "match_phrase": {
            "keyword": {
              "query": "how do i get my money"
            }
          }
        }
      ]
    }
  }
}

Results (Both documents scored same)

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 3.0,
    "hits" : [
      {
        "_index" : "test-index",
        "_type" : "_doc",
        "_id" : "6Xy8wXIB3NtI_ttPGBoV",
        "_score" : 3.0,
        "_source" : {
          "keyword" : "get my money"
        }
      },
      {
        "_index" : "test-index",
        "_type" : "_doc",
        "_id" : "6ny8wXIB3NtI_ttPGBpV",
        "_score" : 3.0,
        "_source" : {
          "keyword" : "my money get here"
        }
      }
    ]
  }
}

Edit 1

As @gibbs suggested, let's remove the "similarity": "boolean". A more simplified and focused issue presented below. We are trying to find an answer to this.

Removed "similarity": "boolean"

PUT test-index
{
  "mappings": {
      "properties" : {
        "keyword" : {
          "type" : "text"
        }
      }
    }
}
POST test-index/_doc/
{
    "keyword": "get my money"
}
POST test-index/_doc/
{
    "keyword": "my money get here"
}

How to make this query return results? now it doesn't. Is it possible to return results if all searched words don't exist in a document when using match_phrase?

GET /test-index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "keyword": {
              "query": "how do I get my money"
            }
          }
        }
      ]
    }
  }
}

Edit 2

In our use case, we can't use BM25 (TF/IDF) because that messes up our results.

POST test-index/_doc
{
  "keyword": "get my money, claim, distribution, getting started"
}

POST test-index/_doc 
{
  "keyword": "my money get here"
}
GET /test-index/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "keyword": "how do I get my money"
          }
        }
      ]
    }
  }
}

Results

{
  "took" : 16,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.6156533,
    "hits" : [
      {
        "_index" : "test-index",
        "_type" : "_doc",
        "_id" : "JnxCw3IB3NtI_ttPBjQv",
        "_score" : 0.6156533,
        "_source" : {
          "keyword" : "my money get here"
        }
      },
      {
        "_index" : "test-index",
        "_type" : "_doc",
        "_id" : "x3xSw3IB3NtI_ttP1DUi",
        "_score" : 0.49206492,
        "_source" : {
          "keyword" : "get my money, claim, distribution, getting started"
        }
      }
    ]
  }
}

In this scenario my money get here scores more than intended get my money because of TF/IDF. So, we can't have it where Score calculation will depend on the number of documents match, length of field, etc.

Sorry for the very long question. So, back to my original question How can I construct my index/query so that it takes word order into account but does not require all searched words to exist in document?

Mehedi Hasan
  • 61
  • 1
  • 10

1 Answers1

0

The problem is because of your similarity parameter.

A simple boolean similarity, which is used when full-text ranking is not needed and the score should only be based on whether the query terms match or not. Boolean similarity gives terms a score equal to their query boost

Reference

You should use other similarity parameters (BM25) to get better scores.

I removed similarity parameter from your mapping and indexed same data. It used default similarity parameter.

Score is as follows.

{
    "took": 1069,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.5809142,
        "hits": [
            {
                "_index": "test-index",
                "_type": "_doc",
                "_id": "WpaHwnIBa8oXh9OgX4Hb",
                "_score": 0.5809142,
                "_source": {
                    "keyword": "get my money"
                }
            },
            {
                "_index": "test-index",
                "_type": "_doc",
                "_id": "W5aHwnIBa8oXh9OgeYG9",
                "_score": 0.5167642,
                "_source": {
                    "keyword": "my money get here"
                }
            }
        ]
    }
}
Gibbs
  • 21,904
  • 13
  • 74
  • 138
  • Sorry for the confusion, check now reply now. – Mehedi Hasan Jun 17 '20 at 17:25
  • Another data point on why I need to use `"similarity": "boolean"` Without `"similarity": "boolean"` ``` POST test-index/_doc { "keyword" : "get my money, claim, distribution, getting started" } POST test-index/_doc { "keyword" : "my money get here" } GET /test-index/_search { "query": { "bool": { "must": [ { "match": { "keyword": "how do i get my money" } } ] } } } ``` In this scenario **my money get here** scores more than intended **get my money**. – Mehedi Hasan Jun 17 '20 at 17:33
  • No, I am not able to reproduce this. I still see that `get my money` have got more score. I think it is something to do with your data as well. Score calculation depends also on the number of documents match. – Gibbs Jun 17 '20 at 18:12
  • Updated my question, please take a look – Mehedi Hasan Jun 18 '20 at 06:56
  • Have a look at [this](https://stackoverflow.com/questions/27538766/scoring-by-term-position-in-elasticsearch) It will help you – Gibbs Jun 18 '20 at 07:54