6

I have millions of records in ElasticSearch. Today, I realized there are some records duplicated. Is there any way to remove these duplicated records?

This is my query.

  {
  "query": {
        "filtered":{    
            "query" : {
                "bool": {"must":[ 
                        {"match": { "sensorId":  "14FA084408" }},
                  {"match": { "variableName":  "FORWARD_FLOW" }}
                  ]
                    }
            },  
            "filter": {
                "range": { "timestamp": { "gt" : "2015-07-04",
                                             "lt" : "2015-07-06" }}
            }
        }
    }
}

And this is what I recieve from it.

{
"took": 2,
"timed_out": false,
"_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
},
"hits": {
    "total": 21,
    "max_score": 8.272615,
    "hits": [
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxVcMpd7AZtvmZcK",
            "_score": 8.272615,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxVnMpd7AZtvmZcL",
            "_score": 8.272615,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxV6Mpd7AZtvmZcN",
            "_score": 8.0957,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxWOMpd7AZtvmZcP",
            "_score": 8.0957,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxW8Mpd7AZtvmZcT",
            "_score": 8.0957,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxXFMpd7AZtvmZcU",
            "_score": 8.0957,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxXbMpd7AZtvmZcW",
            "_score": 8.0957,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxUtMpd7AZtvmZcG",
            "_score": 8.077545,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxXPMpd7AZtvmZcV",
            "_score": 8.077545,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        },
        {
            "_index": "iotsens-summarizedmeasures",
            "_type": "summarizedmeasure",
            "_id": "AU5isxUZMpd7AZtvmZcE",
            "_score": 7.9553676,
            "_source": {
                "id": null,
                "sensorId": "14FA084408",
                "variableName": "FORWARD_FLOW",
                "rawValue": "0.2",
                "value": "0.2",
                "timestamp": 1436047200000,
                "summaryTimeUnit": "DAYS"
            }
        }
    ]
}

}

As you can see, I have 21 duplicated records for the same day. How can I delete the duplicated records an preserve only one per day? Thanks.

Tombart
  • 30,520
  • 16
  • 123
  • 136
Sapikelio
  • 2,594
  • 2
  • 16
  • 40
  • Manually deleting them. ES doesn't auto-magically remove "duplicates" (your own definition of a duplicate record). – Andrei Stefan Jul 07 '15 at 09:23
  • I know that I Should manually delete them, but I want to know the best way or most efficient to do this. Perhaps I have not expressed well my question. Thanks you anyway. – Sapikelio Jul 07 '15 at 09:45

3 Answers3

4

Do a count(Use Count API for this), then use delete by query with the query size being one less than the count. (Use delete by query + From/Size API to get this)

Count API

From/size API

Delete by query API

In this case you should write your query such that it gets only duplicate records.

Or just query for the id's and invoke bulk delete on all but one. But, I guess you can't do this as you don't have Id. IMHO, I don't see any other smart way to do this.

Vamsi Krishna
  • 3,742
  • 4
  • 20
  • 45
  • Thanks I'm gonna try your recomendation. – Sapikelio Jul 07 '15 at 10:37
  • Finally I solved my problem using an script but I mark your answer as correct because It helps me to find the right solution. Thanks you. – Sapikelio Jul 30 '15 at 11:51
  • 3
    @Sapikelio Could you please post the script? I have this same problem but it's with millions of records and I'm trying to find the most scalable method. – Duplexia Oct 16 '15 at 03:46
4

Using aggregate queries you can find duplicated fields in you ES index:

e.g. find 3 documents that have the same value in field Uuid (and return at most 5 duplicated documents for each Uuid):

curl -XPOST http://localhost:9200/logstash-2017.03.17/_search -d '
 {
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
        "field": "Uuid",
        "min_doc_count": 2,
        "size": 3
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {
            "size": 5
          }
        }
      }
    }
  }
}'

From the output you could easily filter document _id and delete them. With jq:

cat es_response.json | jq -r '.aggregations.duplicateCount.buckets[].duplicateDocuments.hits.hits[]._id'

Then naive approach would be using DELETE requests:

 curl -XDELETE http://localhost:9200/{index}/{document type}/{_id value}

However this would delete ALL documents that were duplicated without leaving single unique document in index (usually, see bellow). Moreover, separate DELETE queries are extremely inefficient.

I wrote a es-deduplicator tool, that leaves out one document for each group of duplicated documents and deletes rest via Bulk API.

This way thousands of documents can be deleted in several minutes:

ES query took 0:01:44.922958, retrieved 10000 unique docs
Deleted 232539 duplicates, in total 1093490. Batch processed in 0:00:07.550461, running time 0:09:03.853110
ES query took 0:01:38.117346, retrieved 10000 unique docs
Deleted 219259 duplicates, in total 1312749. Batch processed in 0:00:07.351001, running time 0:10:50.322695
ES query took 0:01:40.111385, retrieved 10000 unique docs

NOTE: when deleting documents in loop it's quite essential to refresh index after each bulk request, otherwise next query might return already deleted documents.

By design aggregate queries are approximate, it's quite likely that few documents might be omitted (depends on how many shards and nodes you have). With multiple nodes (typical cluster setup) it's better to query again by unique field (and delete extra copies).

Tombart
  • 30,520
  • 16
  • 123
  • 136
0

This is a random idea and might not exactly fit your needs. Still this was what I felt on reading your question at the first go.

How about we reindex the entire data using any elasticsearch client library. While doing so, lets just compute a hashcode for each of these objects (documents i mean) and set it up as the id of the document. Any document that have all fields exactly the same gets reindexed to the same id and hence duplication is removed once the reindexing is complete.

AvisekS
  • 86
  • 1
  • 5
  • 1
    Thats an idea which I would suggest for dev environment. As far as OP is concerned, I guess the code is already running in prod environment, if I'm not wrong. (OP quoted that he has millions of records) So, his main aim was to remove the existing duplicates first. Reindexing is the most bitter solution as far as ES is concerned. – Vamsi Krishna Jul 07 '15 at 15:19
  • Thanks you both to answer. Vamsi's righ, all the data is in prod environment, and I can't delete and reindex It. I'll try to make a script with Java api to get all records and delete duplicates in runtime. – Sapikelio Jul 08 '15 at 06:09