0

I think this is a straightforward application, yet I cannot find a recipe on the internet.

Can you suggest a JSON query to send through python to an Elasticsearch instance that would return the frequency of a specific term in a certain field?

I guess it should be possible by some tweak of the Term Vector API, but it seems not straightforward.

I would not mind to get both the absolute frequency and the number of document containing the term.

Radio Controlled
  • 825
  • 8
  • 23
  • So there is no direct way? I have to get the docids first and then either count them or aggregate the tf over all of them for each term? – Radio Controlled Feb 11 '20 at 14:22

2 Answers2

1

If you have the ids, you can use Multivectors API https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-multi-termvectors.html

curl -X POST "localhost:9200/index/type/_mtermvectors?pretty" -H 'Content-Type: application/json' -d' 
{
    "ids" : ["your_document_id1","your_document_id2"],      
    "parameters": {
        "fields": [
                "your_field"       
        ],
        "term_statistics": true
    }
}
'

You can even pass an artifical document with the terms you want to analyse. As stated here (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html), make sure term_statistics is set up to true, so you can get this info across your index:

  • total term frequency (how often a term occurs in all documents)
  • document frequency (the number of documents containing the current term)
Fran García
  • 2,011
  • 16
  • 24
  • Thanks, could you complete this answer by specifying what means "the ids" (document_ids or term_ids)? And if it refers to term ids, how to retrieve them given the actual term string. And if it refers to document ids, how to get all document ids or (for efficiency) how to get the document ids for the respective term (I guess this is in the link, too). – Radio Controlled Feb 12 '20 at 09:00
  • 1
    I meant document ids (updated the answer). I don't think you can get the frequency terms for all your documents indexed, but it seems you can pass an artificial document to the multi-term vectors api with the terms you want to check. This, in combination with the setting term_statistics to true, I think the closest solution you can get from what you need. – Fran García Feb 12 '20 at 09:05
0

There is actually a simple solution, goes like this:

from elasticsearch import Elasticsearch as ES
from copy import deepcopy as copy
import sys

_field = sys.argv[1]
_terms = sys.argv[2:]

_timeout = 60
_gate    = 'some.gate.org/'
_index   = 'some_index'
_client  = ES([_gate],scheme='http',port=80,timeout=_timeout) #or however to get connection

_body= {"doc": {_field: None}, "term_statistics" : True, "field_statistics" : True, "positions": False, "offsets": False}

for term in terms_:
    body   = copy(_body); body["doc"][_field] = term
    result = _client.termvectors(index=_index,body=body)
    print 'documents with', term, ':', result['term_vectors'][_field]['terms'][term]['doc_freq']
    print 'frequency of  ', term, ':', result['term_vectors'][_field]['terms'][term]['ttf']
Radio Controlled
  • 825
  • 8
  • 23