ElasticSearch: minimum_should_match and length of terms list

Question

Using ElasticSearch I'm trying to use the minimum_should_match option on a Terms Query to find documents that have a list of longs that is X% similar to the list of longs I'm querying with.

e.g:

{
    "filter": {
        "fquery": {
            "query": {
                "terms": {
                    "mynum": [1, 2, 3, 4, 5, 6, 7, 8, 9, 13],
                    "minimum_should_match": "90%",
                    "disable_coord": False
                }
            }
        }
    }
}

will match two documents with a mynum list of:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

and:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12]

This works and is correct since the first document has a 10 at the end while the query contained a 13 and the second document contained an 11 where again the query contained a 13.

Which means that 1 ou of 10 numbers in my query's list is different in the returned document and amounts to the allowed 90% similarity (minimum_should_match) value in the query.

Now the issue that I have is that I would like the behaviour to be different in the sense that since the second document is longer and has 11 numbers in place of 10, the difference level should ideally have been higher since it has actually two values 11 and 12 that are not in the query's list. e.g:

Instead of computing the intersection of:

(list1) [1, 2, 3, 4, 5, 6, 7, 8, 9, 13]

with:

(list2) [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12]

which is a 10% difference

it should say that since list2 is longer than list1, the intersection should be:

(list2) [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12]

with:

(list1) [1, 2, 3, 4, 5, 6, 7, 8, 9, 13]

which is a 12% difference

Is this possible ?
If not, how could I weight in the length of the list besides using a dense vector rather than a sparse one ? e.g:

using

[1, 2, 3, 4, 5, 6, 7, 8, 9, , , , 13]

rather than:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 13]

a possible workaround is to set mapping of mynum to string type not_analyzed and enable [norms](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#norms) i haven't tried it but probably worth a go — keety, Jul 09 '15 at 03:09

ElasticSearch: minimum_should_match and length of terms list

0 Answers0