7

I'm trying wrap my mind around how the more like this query works, and I seem to be missing something. I read the documentation, but the ES documentation is often somewhat...lacking.

The goal is to be able to limit results by term frequency, as attempted here.

So I set up a simple index, including term vectors for debugging, then added two simple docs.

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
   },
   "mappings": {
      "doc": {
         "properties": {
            "text": {
               "type": "string",
               "term_vector": "yes"
            }
         }
      }
   }
}

PUT /test_index/doc/1
{
    "text": "apple, apple, apple, apple, apple"
}

PUT /test_index/doc/2
{
    "text": "apple, apple"
}

When I look at the termvectors I see what I expect:

GET /test_index/doc/1/_termvector
...
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text": {
         "field_statistics": {
            "sum_doc_freq": 2,
            "doc_count": 2,
            "sum_ttf": 7
         },
         "terms": {
            "apple": {
               "term_freq": 5
            }
         }
      }
   }
}

GET /test_index/doc/2/_termvector
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "2",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text": {
         "field_statistics": {
            "sum_doc_freq": 2,
            "doc_count": 2,
            "sum_ttf": 7
         },
         "terms": {
            "apple": {
               "term_freq": 2
            }
         }
      }
   }
}

When I run the following query with "min_term_freq": 1 I get back both docs:

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple",
         "min_term_freq": 1,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.5816214,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.5816214,
            "_source": {
               "text": "apple, apple, apple, apple, apple"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 0.5254995,
            "_source": {
               "text": "apple, apple"
            }
         }
      ]
   }
}

But if I increase "min_term_freq" to 2 (or more) I get nothing, though I would expect both documents to be returned:

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple",
         "min_term_freq": 2,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

Why? What am I missing?

If I want to set up a query that would only return the document in which "apple" appears 5 times, but not the one in which it appears 2 times, is there a better way?

Here is the code, for convenience:

http://sense.qbox.io/gist/341f9f77a6bd081debdcaa9e367f5a39be9359cc

Community
  • 1
  • 1
Sloan Ahrens
  • 8,588
  • 2
  • 29
  • 31

2 Answers2

11

The min term frequency and min doc frequency are actually applied on the input before doing the MLT. Which means as you have only one occurrence of apple in your input text , apple was never qualified for MLT as min term frequency is set to 2. If you change your input to "apple apple" as below , things will work -

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple apple",
         "min_term_freq": 2,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}

Same goes for min doc frequency too. Apple is found in atleast 2 document , so min_doc_freq upto 2 will qualify apply from input text for MLT operations.

Vineeth Mohan
  • 18,633
  • 8
  • 63
  • 77
  • Thanks, Vineeth. That works, though I still don't understand why. If I search for `{... "like_text": "apple apple apple", "min_term_freq": 3,...}` I still get both results, even though "apple" occurs less than 3 times in one of the documents. So how can I limit the results to the ones in which the term occurs at or above the minimum frequency? – Sloan Ahrens Feb 04 '15 at 02:35
  • 2
    I don't think you can use MLT for that. Both the min frequency and min doc frequency constrains are actually applied in input text rather than the compare document. Another way would be to use the scripting plug to achieve this in the filter script side - http://stackoverflow.com/questions/28296320/elasticsearch-filter-via-number-of-mentions/28312561#28312561 – Vineeth Mohan Feb 04 '15 at 02:39
  • I think mlt query doesn't support "percent_terms_to_match", at least it doesn't work for ES 2.2 – isaranchuk Apr 28 '16 at 09:10
  • Will MLT work in the value of the property is not text but an array of numbers? If not, if there something that would work for this effect? I need to use the tags of a doc and use them to retrieve other docs that have the most amount of matching tags (numbers) – George Cscnt Mar 06 '20 at 01:04
6

As the poster of this question, I was trying to wrap my mind around the more_like_this query, too...

I struggled a bit to find good sources of information on the web, but (as in most cases) documentation seems to help the most, so, here's the link to the documentation, and some more important terms (and/or a bit more difficult to understand, so I added my interpretation):

max_query_terms - The maximum number of query terms that will be selected (from each input document). Increasing this value gives greater accuracy at the expense of query execution speed. Defaults to 25.

min_term_freq - The minimum term frequency below which the terms will be ignored from the input document. Defaults to 2.

If the term appears in the input document less than 2 (default) times, it will be ignored from the input document, i.e. not be searched for in other possible more_like_this documents.

min_doc_freq - The minimum document frequency below which the terms will be ignored from the input document. Defaults to 5.

This one took me a second to get, so, here's my interpretation:

In how many documents a term from the input document must appear in order to be selected as a query term.

There it is, I hope I saved someone a few minutes of his life. :)

Cheers!

Filip Savic
  • 2,737
  • 1
  • 29
  • 34