0

I have exactly 100 documents were already indexed in Elasticsearch and I need to update each document by adding a new simple field using the following function:

def add_new_field( ):
ES_HOST = {"host" : "localhost", "port" : 9200}
ES= Elasticsearch(hosts = [ES_HOST], timeout = 180)
for i in range(100):
    ES.update(
            index='history',
            doc_type='resources',
            id=i,
            body={ "doc" : {"square" : i**2} }
            )

The problem is: after executing this function, the 'doc_freq' of some terms are more than the expected document_frequency.(note: I have set 'dfs = True').

E.g.: 'term1' exists in all documents (so 'doc_freq' should be 100), instead I got 'doc_freq'=113

K.Ali
  • 283
  • 4
  • 15
  • The term and field statistics are not accurate. Deleted documents are not taken into account. If you updated documents this means you have deleted documents. 113 can come from those deleted documents (or better said "marked as deleted" documents). – Andrei Stefan Apr 06 '16 at 16:39
  • Is there any other way to update the documents avoiding this problem? or to exclude the old versions? – K.Ali Apr 06 '16 at 16:50
  • The "old", marked as deleted documents will be physically removed from the index when the segments they reside in are merged. Try an [optimize with only_expunge_deletes](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-optimize.html#optimize-parameters) and see if you get back the same document frequency. – Andrei Stefan Apr 06 '16 at 16:52

0 Answers0