0

I'm fairly new to Elasticsearch and trying to periodically delete documents using the _delete_by_query API (I fully appreciate I should probably be using time based indices to make this easier, and will be updating the indexing structure in due course, but for now I need to get this working).

My index contains fields called ServiceName, message and timestamp (among others) and my requirement is pretty simple. I want to delete documents where ServiceName equals a specific value (myService), the message does NOT equal either of two specific values (Starting* and Finished* as I want to retain the first and last log message from any trace history), and the document is old than one day. I am using the _delete_by_query endpoint with the following payload:

{
    "query": {
        "bool": {
            "must": [],
            "filter": [{
                    "match_all": {}
                },
                {
                    "match_phrase": {
                        "ServiceName": {
                            "query": "myService"
                        }
                    }
                },
                {
                    "range": {
                         "@timestamp": {
                        "lte": "now-1d"
                        }
                    }
                }
            ],
            "should": [],
            "must_not": [{
                "bool": {
                    "should": [{
                            "match_phrase": {
                                "message": "Starting*"
                            }
                        },
                        {
                            "match_phrase": {
                                "message": "Finished*"
                            }
                        }
                    ],
                    "minimum_should_match": 1
                }
            }]
        }
    }
}

When I run the query using the _search API, it returns the data I'd expect to be deleted, but when I issued the same query to _delete_by_query, it deleted documents that were not returned in the search results. I am using AWS Elasticsearch Service. Can anybody tell me where I'm going wrong or should this work? I thought initially it might be the minimum_should_match property however the documentation seems to suggest this is irrelevant in this case

pr.lwd
  • 140
  • 10
  • 1
    `...it deleted documents that were returned in the search results` which is what you'd expect right? – Val Nov 13 '20 at 09:11
  • Ha - typo there. It deleted documents that did not get returned by the search results, is what I should have said. Will edit – pr.lwd Nov 13 '20 at 10:07
  • Ok, it's more logical that way ;-) I find it very surprising though... How many results do you get for the search query? and how many were actually deleted with the same query? – Val Nov 13 '20 at 10:08
  • Hmm maybe I genuinely did something wrong. I'll give it another go as I was surprised too. We are talking about 20 million documents as it's built up over time – pr.lwd Nov 13 '20 at 14:36
  • Thanks everyone for comments - I've reworked indexing strategy to use date based indices so purging old data is now a simple case of deleting the index – pr.lwd Dec 17 '20 at 14:52
  • Great, yes, that's how you're supposed to be handling time-based data. Much easier – Val Dec 17 '20 at 14:58

1 Answers1

0

This is strange. Can you check if the the number of hits are same or not? Search results in kibana are truncated so that could be a reason of not seeing a particular documents in search results but in deleted documents.

If that is not the case, it will great if you could share sample of two documents.

  • Doc A: Gets listed in search and also gets deleted.
  • Doc B: Not listed in search but gets deleted.

This will help in replicating the issue at my end and get back to you.

Ankit Garg
  • 540
  • 2
  • 9