Not be the best but my two cents!!
I don't think Elasticsearch provides the exact solution out of the box for this use case. The closest way to do what you want is to make use of More Like This query.
This query essentially helps you find similar documents to that of a document itself you'd provide as input.
Basically the algorithm is:
- Find the top K terms with highest tf-idf from the input document.
- You can specify from the input that min_term_frequency of words should be 1 or 2, and looking at your use-case it would be
1
. Meaning only consider those words from the input document whose term frequency is 1
.
- Construct N number of disjunctive queries based on these terms or rather OR logical operator
- These N number is configurable in the query request, by default it is
25
and the property is max_query_terms
- Execute the queries internally and return the most similar documents.
More accurately from this link,
The MLT query simply extracts the text from the input document, analyzes it, usually using the same analyzer at the field, then selects the top K terms with highest tf-idf to form a disjunctive query of these terms.
Let's see how we can achieve some use-cases that you've mentioned.
Use Case 1: Find documents of a page having min_word_match_score 2.
Note that your field pages
would need to be of nested
type. Otherwise using object
type it wouldn't be possible for this scenario. I suggest you go through the aforementioned links to know more on this.
Let's say I have two indexes
- my_book_index - This would have the documents to be searched on
- my_book_index_input - This would have the documents used as input documents
Both would have mapping structure as below:
{
"mappings": {
"properties": {
"book_id":{
"type": "keyword"
},
"pages":{
"type": "nested"
}
}
}
}
Sample Documents for my_book_index:
POST my_book_index/_doc/1
{
"book_id":"book01",
"pages":[
{ "page_id":1, "words":["11", "12", "13", "14", "105"] },
{ "page_id":2, "words":["21", "22", "23", "24", "205"] },
{ "page_id":3, "words":["31", "32", "33", "34", "305"] },
{ "page_id":4, "words":["41", "42", "43", "44", "405"] }
]
}
POST my_book_index/_doc/2
{
"book_id":"book02",
"pages":[
{ "page_id":1, "words":["11", "12", "13", "104", "105"] },
{ "page_id":2, "words":["21", "22", "23", "204", "205"] },
{ "page_id":3, "words":["301", "302", "303", "304", "305"] },
{ "page_id":4, "words":["401", "402", "403", "404", "405"] }
]
}
POST my_book_index/_doc/3
{
"book_id":"book03",
"pages":[
{ "page_id":1, "words":["11", "12", "13", "100", "105"] },
{ "page_id":2, "words":["21", "22", "23", "200", "205"] },
{ "page_id":3, "words":["301", "302", "303", "300", "305"] },
{ "page_id":4, "words":["401", "402", "403", "400", "405"] }
]
}
Sample Document for my_book_index_input:
POST my_book_index_input/_doc/1
{
"book_id":"book_new",
"pages":[
{ "page_id":1, "words":["11", "12", "13", "14", "15"] },
{ "page_id":2, "words":["21", "22", "23", "24", "25"] }
]
}
More Like This Query:
Use Case: Basically I am interested in finding documents which would be similar to the above documents having 4 matches in page 1
or 4 matches in page 2
POST my_book_index/_search
{
"size": 10,
"_source": "book_id",
"query": {
"nested": {
"path": "pages",
"query": {
"more_like_this" : {
"fields" : ["pages.words"],
"like" : [
{
"_index": "my_book_index_input",
"_id": 1
}
],
"min_term_freq" : 1,
"min_doc_freq": 1,
"max_query_terms" : 25,
"minimum_should_match": 4
}
},
"inner_hits": {
"_source": ["pages.page_id", "pages.words"]
}
}
}
}
Basically I want to search in my_book_index
all the documents that are similar to _doc:1
in the index my_book_index_input
.
Notice each and every parameter in the query. I'd suggest you go through line by line to understand all this.
Note the response below when you execute that query:
Response:
{
"took" : 71,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 6.096043,
"hits" : [
{
"_index" : "my_book_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 6.096043,
"_source" : {
"book_id" : "book01" <---- Document 1 returns
},
"inner_hits" : {
"pages" : {
"hits" : {
"total" : {
"value" : 2, <---- Number of pages hit for this document
"relation" : "eq"
},
"max_score" : 6.096043,
"hits" : [
{
"_index" : "my_book_index",
"_type" : "_doc",
"_id" : "1",
"_nested" : {
"field" : "pages",
"offset" : 0
},
"_score" : 6.096043,
"_source" : {
"page_id" : 1, <---- Page 1 returns as it has 4 matches
"words" : [
"11",
"12",
"13",
"14",
"105"
]
}
},
{
"_index" : "my_book_index",
"_type" : "_doc",
"_id" : "1",
"_nested" : {
"field" : "pages",
"offset" : 1
},
"_score" : 6.096043,
"_source" : {
"page_id" : 2, <--- Page 2 returns as it also has 4 matches
"words" : [
"21",
"22",
"23",
"24",
"205"
]
}
}
]
}
}
}
}
]
}
}
Note that only document with book_id: 1 returned. The reason is simple. I've mentioned the below properties in the query:
"min_term_freq" : 1,
"min_doc_freq": 1,
"max_query_terms" : 25,
"minimum_should_match": 4
Basically, only consider those terms to search from input document whose term freq is 1, which is available in min of 1 documents, and the number of matches in one nested document should be 4.
Change the parameters for e.g. min_doc_freq
to 3
and min_should_match
to 3
, you should see few more documents.
Notice that you would not see all the document fulfilling the above properties, that is because of the way it has been implemented. Remember the steps I've mentioned in the beginning. Perhaps that's why.
Use Case 2: Use Case 1 + Return only those with min page match
is 2
I'm not sure if this is supported i.e. adding filter to inner_hits based on _count of inner_hits
, however I believe this is something you can add it at your application layer. Basically get the above response, calculate the inner_hits.pages.hits.total_value
and thereby return only those documents further to the consumer. Basically below is how your request response flow would be:
For Request: Client Layer (UI) ---> Service Layer --> Elasticsearch
For Response: Elasticsearch ---> Service Layer (filter logic for n pages match) --> Client Layer (or UI)
This may not be the best solution and at times may give you results that may not be as what you expect accurately, but I'd suggest at least giving it a try as the only other solution instead of using this query, is sadly to write your own custom client code which would make use of TermVectorAPI as mentioned in this link.
Remember the algorithm as how MLT query works and see if you can dig deep as why results are returning the way they are.
Not sure if this does, but I hope it helps!