Elasticsearch - sort by score of array matchings within multiple array

Question

Indexed documents

{
  "book_id":"book01",
  "pages":[
    { "page_id":1, "words":["1", "2", "xx"] }
    { "page_id":2, "words":["4", "5", "xx"] }
    { "page_id":3, "words":["7", "8", "xx"] }
  ]
}
{
  "book_id":"book02",
  "pages":[
    { "page_id":1, "words":["1", "xx", "xx"] }
    { "page_id":2, "words":["4", "xx", "xx"] }
    { "page_id":3, "words":["7", "xx", "xx"] }
  ]
}

Input data

{
  "book_id":"book_new",
  "pages":[
    { "page_id":1, "words":["1", "2", "3"] }
    { "page_id":2, "words":["4", "5", "6"] }
    { "page_id":3, "words":["xx", "xx", "xx"] }
  ]
}

I have a number of books that have multiple pages. Each page has a list of words. I would like to search for books with more-than-threshold similar pages.

Thresholds

min_word_match_score : 2 (minimum score of words match between two pages)
min_page_match_score : 2 (minimum number of similar pages between two books)

Key terms

similar pages: Two pages that have at least min_word_match_score same words
similar book: Two books that have at least min_page_match_score similar pages

Expected result

Based on the specified thresholds, the correct return should be only book01 because

book01-1 and book_new-1 have score 2 (>=min_word_match_score, totalScore++)
book01-2 and book_new-2 have score 2 (>=min_word_match_score, totalScore++)
book01 and book_new have 2 total scores (totalScore >= min_page_match_score)

Poor search query (not working)

"bool" : {
   "should" : [
     {
        "match" : { "book_pages.visual_words" : {"query" : "1", "operator" : "OR"} },
        "match" : { "book_pages.visual_words" : {"query" : "2", "operator" : "OR"} },
        "match" : { "book_pages.visual_words" : {"query" : "3", "operator" : "OR"} }
     }
   ],
   "minimum_should_match" : 2
   "adjust_pure_negative" : true,
   "boost" : 1.0
 }
}

I first tried to make a part if the query for page match but it's not search array by array and it's just searching against words of all pages. And I am not really sure how to manage the two different scores - words-match-score and pages-match-score.

Should I dig into innerHit? Please help!

Is your field `pages` a `nested` or an `object` type? – Kamal Kunjapur Mar 18 '20 at 06:08 — Kamal Kunjapur, Mar 18 '20 at 06:08

score 0 · Answer 1 · answered Mar 19 '20 at 12:03

Not be the best but my two cents!!

I don't think Elasticsearch provides the exact solution out of the box for this use case. The closest way to do what you want is to make use of More Like This query.

This query essentially helps you find similar documents to that of a document itself you'd provide as input.

Basically the algorithm is:

Find the top K terms with highest tf-idf from the input document.
You can specify from the input that min_term_frequency of words should be 1 or 2, and looking at your use-case it would be 1. Meaning only consider those words from the input document whose term frequency is 1.
Construct N number of disjunctive queries based on these terms or rather OR logical operator
These N number is configurable in the query request, by default it is 25 and the property is max_query_terms
Execute the queries internally and return the most similar documents.

Use Case 1: Find documents of a page having min_word_match_score 2.

Note that your field pages would need to be of nested type. Otherwise using object type it wouldn't be possible for this scenario. I suggest you go through the aforementioned links to know more on this.

Let's say I have two indexes

my_book_index - This would have the documents to be searched on
my_book_index_input - This would have the documents used as input documents

Both would have mapping structure as below:

{
  "mappings": {
    "properties": {
      "book_id":{
        "type": "keyword"
      },
      "pages":{
        "type": "nested"
      }
    }
  }
}

Sample Documents for my_book_index:

POST my_book_index/_doc/1
{
  "book_id":"book01",
  "pages":[
    { "page_id":1, "words":["11", "12", "13", "14", "105"] },
    { "page_id":2, "words":["21", "22", "23", "24", "205"] },
    { "page_id":3, "words":["31", "32", "33", "34", "305"] },
    { "page_id":4, "words":["41", "42", "43", "44", "405"] }
  ]
}

POST my_book_index/_doc/2
{
  "book_id":"book02",
  "pages":[
    { "page_id":1, "words":["11", "12", "13", "104", "105"] },
    { "page_id":2, "words":["21", "22", "23", "204", "205"] },
    { "page_id":3, "words":["301", "302", "303", "304", "305"] },
    { "page_id":4, "words":["401", "402", "403", "404", "405"] }
  ]
}

POST my_book_index/_doc/3
{
  "book_id":"book03",
  "pages":[
    { "page_id":1, "words":["11", "12", "13", "100", "105"] },
    { "page_id":2, "words":["21", "22", "23", "200", "205"] },
    { "page_id":3, "words":["301", "302", "303", "300", "305"] },
    { "page_id":4, "words":["401", "402", "403", "400", "405"] }
  ]
}

Sample Document for my_book_index_input:

POST my_book_index_input/_doc/1
{
  "book_id":"book_new",
  "pages":[
    { "page_id":1, "words":["11", "12", "13", "14", "15"] },
    { "page_id":2, "words":["21", "22", "23", "24", "25"] }
  ]
}

More Like This Query:

Use Case: Basically I am interested in finding documents which would be similar to the above documents having 4 matches in page 1 or 4 matches in page 2

POST my_book_index/_search
{
  "size": 10,
  "_source": "book_id", 
  "query": {
    "nested": {
      "path": "pages",
      "query": {
        "more_like_this" : {
          "fields" : ["pages.words"],
          "like" : [
            {
              "_index": "my_book_index_input",
              "_id": 1
            }
          ],
          "min_term_freq" : 1,
          "min_doc_freq": 1,
          "max_query_terms" : 25,
          "minimum_should_match": 4
        }
      },
      "inner_hits": {
        "_source": ["pages.page_id", "pages.words"]
      }
    }
  }
}

Basically I want to search in my_book_index all the documents that are similar to _doc:1 in the index my_book_index_input.

Notice each and every parameter in the query. I'd suggest you go through line by line to understand all this.

Note the response below when you execute that query:

Response:

{
  "took" : 71,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 6.096043,
    "hits" : [
      {
        "_index" : "my_book_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 6.096043,
        "_source" : {
          "book_id" : "book01"                     <---- Document 1 returns
        },
        "inner_hits" : {
          "pages" : {
            "hits" : {
              "total" : {
                "value" : 2,                       <---- Number of pages hit for this document
                "relation" : "eq"
              },
              "max_score" : 6.096043,
              "hits" : [
                {
                  "_index" : "my_book_index",
                  "_type" : "_doc",
                  "_id" : "1",                     
                  "_nested" : {
                    "field" : "pages",
                    "offset" : 0
                  },
                  "_score" : 6.096043,
                  "_source" : {
                    "page_id" : 1,                 <---- Page 1 returns as it has 4 matches
                    "words" : [
                      "11",
                      "12",
                      "13",
                      "14",
                      "105"
                    ]
                  }
                },
                {
                  "_index" : "my_book_index",
                  "_type" : "_doc",
                  "_id" : "1",
                  "_nested" : {
                    "field" : "pages",
                    "offset" : 1
                  },
                  "_score" : 6.096043,
                  "_source" : {
                    "page_id" : 2,                 <--- Page 2 returns as it also has 4 matches
                    "words" : [
                      "21",
                      "22",
                      "23",
                      "24",
                      "205"
                    ]
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

Note that only document with book_id: 1 returned. The reason is simple. I've mentioned the below properties in the query:

"min_term_freq" : 1,
"min_doc_freq": 1,
"max_query_terms" : 25,
"minimum_should_match": 4

Basically, only consider those terms to search from input document whose term freq is 1, which is available in min of 1 documents, and the number of matches in one nested document should be 4.

Change the parameters for e.g. min_doc_freq to 3 and min_should_match to 3, you should see few more documents.

Notice that you would not see all the document fulfilling the above properties, that is because of the way it has been implemented. Remember the steps I've mentioned in the beginning. Perhaps that's why.

Use Case 2: Use Case 1 + Return only those with `min page match` is 2

I'm not sure if this is supported i.e. adding filter to inner_hits based on _count of inner_hits, however I believe this is something you can add it at your application layer. Basically get the above response, calculate the inner_hits.pages.hits.total_value and thereby return only those documents further to the consumer. Basically below is how your request response flow would be:

For Request: Client Layer (UI) ---> Service Layer --> Elasticsearch

For Response: Elasticsearch ---> Service Layer (filter logic for n pages match) --> Client Layer (or UI)

This may not be the best solution and at times may give you results that may not be as what you expect accurately, but I'd suggest at least giving it a try as the only other solution instead of using this query, is sadly to write your own custom client code which would make use of TermVectorAPI as mentioned in this link.

Remember the algorithm as how MLT query works and see if you can dig deep as why results are returning the way they are.

Not sure if this does, but I hope it helps!