8

Test data:

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '{ "body": "this is a test" }'
curl -XPUT 'localhost:9200/customer/external/2?pretty' -d '{ "body": "and this is another test" }'
curl -XPUT 'localhost:9200/customer/external/2?pretty' -d '{ "body": "this thing is a test" }'

My goal is to get the frequency of a phrase in a document.

I know how to get the frequency of the terms in a document:

curl -g "http://localhost:9200/customer/external/1/_termvectors?pretty" -d'
{
        "fields": ["body"],
        "term_statistics" : true
}'

And I know how to count the documents that contains a given phrase (with match_phrase or span_near query):

curl -g "http://localhost:9200/customer/_count?pretty" -d'
{
  "query": {
    "match_phrase": {
      "body" : "this is"
      }
    }    
}'

How can I access the frequency of a phrase ?

  • 1
    It sounds like it's not really possible, at least at the ES level, based on this discussion: https://discuss.elastic.co/t/phrase-frequency-in-a-document-and-in-the-whole-collection/61616/3 – chilladx Oct 04 '17 at 15:57
  • I found this discussion, but my understanding is that there is no way to get "the sum of the phrase freqs for all documents" which is not really what i am after. Rather the phrase freq for one document. Am I misinterpreting? – Gilles Cuyaubere Oct 04 '17 at 16:01
  • "we need these stats to develop our own scoring model" this makes me think it's a per document stat, computed during the request. – chilladx Oct 04 '17 at 16:08
  • Yes, I am looking for a per document stat. Any idea on how I could get it? – Gilles Cuyaubere Oct 04 '17 at 16:20
  • How large are those phrases? If it has a certain length, you could use Shingles, and generate all combinations of the N-grams at indexing time. Then you could look up the frequencies of those tokens. – drjz Oct 04 '17 at 16:38
  • No fixed nb of words, but I can generate the shingles of max length of phrase and then use a keep word token filters on the list of phrase I want to match against. – Gilles Cuyaubere Oct 05 '17 at 09:37

1 Answers1

1

You can use termvectors. As written in documentation

Return values edit

Three types of values can be requested: term information, term statistics and field statistics. By default, all term information and field statistics are returned for all fields but no term statistics. Term information edit

term frequency in the field (always returned)
term positions (positions : true)
start and end offsets (offsets : true)
term payloads (payloads : true), as base64 encoded bytes

you have to reach term frequency - in the example you can see that there is the frequency for john doe in doc. Pay attention that termvector duplicate the disk space occupation for the field on which it is applied

Community
  • 1
  • 1
Lupanoide
  • 3,132
  • 20
  • 36
  • Yes but I need my phrase to be considered a token to get its frequency. In the example you mention the "keyword" analyzer is used. Following @drjz's comment I will try to implement a custom analyzer (with shingles) before using termvectors. – Gilles Cuyaubere Oct 05 '17 at 09:13
  • @Gilles Cuyaubere No, absolutely. Term vector field works only with text field, no with the keyword ones. As i suggested yesterday, let's see this example https://www.elastic.co/guide/en/elasticsearch/reference/5.4/docs-termvectors.html#_example_returning_stored_term_vectors And the query per_field analyzer with john doe – Lupanoide Oct 05 '17 at 12:31
  • In the per_field analyzer example, the chosen analyzer is the keyword analyzer https://www.elastic.co/guide/en/elasticsearch/reference/5.4/analysis-keyword-analyzer.html. So the frequency of "John Doe" token in the "John Doe" document is 1. If we come back to my example, this would give me the freq of "this is a test" in "this is a test" when I am actually looking for the freq of "this is" in "this is a test". – Gilles Cuyaubere Oct 05 '17 at 12:58
  • No, You don't understand the trick. If you read well the example there is the mapping of the field fullname. It's not keyword, It's text! but if you want statistic about a sequence of words, such as in your case, you have to query without split them on space, so you use keyword tokenizer as a search analyzer- that is not the tokenizer of the field, as you said! -, that is not the index analyzer. So you can generate shingles only in index analyzers, and query for the frequency of a sequence of words as a single token in the search analyzer. Pay attention! – Lupanoide Oct 05 '17 at 13:47
  • If you query "this is" at termvector endpoint without using keyword as search analyzer it will return in output statistics for "this" and statistic for "is". – Lupanoide Oct 05 '17 at 13:57