12

I have a set of words extracted out of text through NLP algos, with associated score for each word in every document.

For example :

document 1: {  "vocab": [ {"wtag":"James Bond", "rscore": 2.14 }, 
                          {"wtag":"world", "rscore": 0.86 }, 
                          ...., 
                          {"wtag":"somemore", "rscore": 3.15 }
                        ] 
            }

document 2: {  "vocab": [ {"wtag":"hiii", "rscore": 1.34 }, 
                          {"wtag":"world", "rscore": 0.94 },
                          ...., 
                          {"wtag":"somemore", "rscore": 3.23 } 
                        ] 
            }

I want rscores of matched wtag in each document to affect the _score given to it by ES, maybe multiplied or added to the _score, to influence the final _score (in turn, order) of the resulting documents. Is there any way to achieve this?

Haywire
  • 858
  • 3
  • 14
  • 30

4 Answers4

17

Another way of approaching this would be to use nested documents:

First setup the mapping to make vocab a nested document, meaning that each wtag/rscore document would be indexed internally as a separate document:

curl -XPUT "http://localhost:9200/myindex/" -d'
{
  "settings": {"number_of_shards": 1}, 
  "mappings": {
    "mytype": {
      "properties": {
        "vocab": {
          "type": "nested",
          "fields": {
            "wtag": {
              "type": "string"
            },
            "rscore": {
              "type": "float"
            }
          }
        }
      }
    }
  }
}'

Then index your docs:

curl -XPUT "http://localhost:9200/myindex/mytype/1" -d'
{
  "vocab": [
    {
      "wtag": "James Bond",
      "rscore": 2.14
    },
    {
      "wtag": "world",
      "rscore": 0.86
    },
    {
      "wtag": "somemore",
      "rscore": 3.15
    }
  ]
}'

curl -XPUT "http://localhost:9200/myindex/mytype/2" -d'
{
  "vocab": [
    {
      "wtag": "hiii",
      "rscore": 1.34
    },
    {
      "wtag": "world",
      "rscore": 0.94
    },
    {
      "wtag": "somemore",
      "rscore": 3.23
    }
  ]
}'

And run a nested query to match all the nested documents and add up the values of rscore for each nested document which matches:

curl -XGET "http://localhost:9200/myindex/mytype/_search" -d'
{
  "query": {
    "nested": {
      "path": "vocab",
      "score_mode": "sum",
      "query": {
        "function_score": {
          "query": {
            "match": {
              "vocab.wtag": "james bond world"
            }
          },
          "script_score": {
            "script": "doc[\"rscore\"].value"
          }
        }
      }
    }
  }
}'
DrTech
  • 17,031
  • 5
  • 54
  • 48
  • @haywire This approach is probably more scalable and easier to implement than the approach in my other answer – DrTech Feb 01 '14 at 14:11
  • Hi guys, I have just edited the answer above to show the output of the query and the specific scores associated to the 2 existing documents. I was surprised to see that despite the user defined scoring function being a simple sum, the resulting document score is NOT the sum of the matching wtag's rscore. In the query 'world' I would have expected document 1 to get score 0.86 and document 2 be scored 0.94 ... why is that so ? – Samuel Kerrien Jul 10 '14 at 15:13
  • 1
    To make this answer work on newer versions of ES (which have scripting disabled by default), and also to make faster, replace "script_score" : {...} with "field_value_factor" : { "field" : "rscore" } – yahermann Nov 22 '14 at 19:37
  • 1
    I'm using this recipe with field_value_factor (as described in my previous comment), but for some reason the calculated _score gets completely messed up when rscore=1. Mapping confirms rscore="double". Changing rscore=0.99 and any other number but 1 seems to work just fine. Bug? – yahermann Mar 15 '15 at 18:32
  • @DrTech you advocate for nested documents but what if i have 3000 word, value pairs in each document, and I have >100,000,000 documents. I'd hit the Lucene per shard document count limit quickly (for every one document, I'd really have 3001 documents due to the 3000 nested docs). In that case, do you recommend your first answer using payloads? – Darby May 22 '15 at 06:47
9

Have a look at the delimited payload token filter which you can use to store the scores as payloads, and at text scoring in scripts which gives you access to the payloads.

UPDATED TO INCLUDE EXAMPLE

First you need to setup an analyzer which will take the number after | and store that value as a payload with each token:

curl -XPUT "http://localhost:9200/myindex/" -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "payloads": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            " delimited_payload_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "mytype": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "payloads",
          "term_vector": "with_positions_offsets_payloads"
        }
      }
    }
  }
}'

Then index your document:

curl -XPUT "http://localhost:9200/myindex/mytype/1" -d'
{
  "text": "James|2.14 Bond|2.14 world|0.86 somemore|3.15"
}'

And finally, search with a function_score query that iterates over each term, retrieves the payload and incorporates it with the _score:

curl -XGET "http://localhost:9200/myindex/mytype/_search" -d'
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "text": "james bond"
        }
      },
      "script_score": {
        "script": "score=0; for (term: my_terms) { termInfo = _index[\"text\"].get(term,_PAYLOADS ); for (pos : termInfo) { score = score +  pos.payloadAsFloat(0);} } return score;",
        "params": {
          "my_terms": [
            "james",
            "bond"
          ]
        }
      }
    }
  }
}'

The script itself, when not compressed into one line, looks like this:

score=0; 
for (term: my_terms) { 
    termInfo = _index['text'].get(term,_PAYLOADS ); 
    for (pos : termInfo) { 
        score = score +  pos.payloadAsFloat(0);
    } 
} 
return score;

Warning: accessing payloads has a significant performance cost, and running scripts also has a performance cost. You may want to experiment with it using dynamic scripts as above, then rewrite the script as a native Java script when you're satisfied with the result.

DrTech
  • 17,031
  • 5
  • 54
  • 48
  • 1
    Can you please explain how do I go about using delimited payload token filter and text scoring? I could not get my head around it in the context of the question. Even though I am choosing moliware's answer (as it was simple to understand), I might actually be missing something vital! Thanks for your pointers anyways :-) – Haywire Feb 01 '14 at 12:20
  • Added a full example. – DrTech Feb 01 '14 at 13:59
  • Why did you have to repeat the terms in the script score? Doesn't the script score have access to the query terms ? – itaifrenkel May 27 '14 at 05:58
  • No, it is completely separate. And you may want to pass another set of terms there anyway. – DrTech May 29 '14 at 21:41
  • @DrTech Can you further qualify the performance cost for accessing payloads? It's an extra 8bytes (or is it 4) for every token in every document it's associated with. I presume this has to be held in memory to be efficient? And presuming it can be, the penalty is the extra lookup in the positions (memmapped) file? - oh actually, I would have to save the position of the tokens too - so that's another 4bytes-ish? 12-total? -- great book BTW – JnBrymn Jul 30 '15 at 04:18
  • Thanks for the full example. I didn't find it that difficult to extend to using rescore, so that the script only runs on the top 100 results. – Greg Lindahl Mar 18 '16 at 21:45
2

I think that script_score function is what you need (doc).

Function score queries were introduced in 0.90.4 if you are using an older version check custom score queries

moliware
  • 10,160
  • 3
  • 37
  • 47
  • Thanks a lot! I am editing your solution to include the details and marking it as the answer so that others may find it without any hassle – Haywire Feb 01 '14 at 11:59
  • @haywire your edition was refused so I updated it for having the full example which may be useful for others. – moliware Feb 01 '14 at 18:07
  • 1
    This solution won't work correctly because you are not using nested fields. Your array of objects is collapsed into two multi value fields with no correlation between wtag and rscore. – DrTech Feb 02 '14 at 09:12
  • @DrTech Yess! I realised something is not right but couldn't figure it out for my life. Thanks! – Haywire Feb 03 '14 at 03:35