Elasticsearch: Find duplicates by field

Question

Im working with elasticsearch. I got collection of events, where are event names, for ex. FC Barcelona - Real Madrit, then somewhere in collection may be Footbal Club Barcela - FC Real Madryt.

I need to find minimum 2 hits without query text. I think aggregation and ngram tokenizer should be used here, but I'm not sure.

Here are my index settings:

{
        "settings": {
            "analysis": {
                "analyzer": {
                    "test": {
                        "tokenizer": "test",
                        "filter": ["lowercase", "word_delimiter", "nGram", "porter_stem"]
                        "token_chars": [
                            "letter",
                            "digit",
                            "whitespace"
                        ]
                    }
                },
                "tokenizer": {
                    "test": {
                        "type": "ngram",
                        "min_gram": 3,
                        "max_gram": 15,
                    }
                }
            }
        }
    }

And that's how my current query looks like:

{
  "size": 0,
  "aggs": {
    "duplicateNames": {
      "terms": {
        "field": "eventName",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

And here is my mapping:

{
            "event": {
                "properties": {
                    "eventName": {
                        "type": "keyword",
                        // fielddata: true
                    }
                }
            }
        }

Could u point me in the right direction, please?

Your mapping and what queries you've tried so far would help us to be able to answer your question. — Tim, Sep 26 '18 at 13:23

score 1 · Answer 1 · answered Sep 26 '18 at 14:34

1

You shouldn't need the nGrams if you are looking for duplicates. You'll want to use the keyword type like you have. You can use the terms aggregation like you already have.

POST <index_name>/event/_search
{
  "size": 0,
  "aggs": {
    "duplicateNames": {
      "terms": {
        "field": "eventName",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

The duplicate eventName will be listed in the duplicateEventNames aggregation buckets. The document _id will be in the top hits in each bucket.

answered Sep 26 '18 at 14:34

Tim

1,276
11
23

Is it possible to filter out items by date? Lets say each event has a eventStart and i want to display only one of these which are starting by same day. – shareone2 Sep 27 '18 at 09:53
You probably want to look into a histogram aggregation. Something like [this answer](https://stackoverflow.com/a/47938636/229778). Feel free to ask another question on SO if you're running into issues with it. – Tim Sep 27 '18 at 12:41
Well this does not work, because it finds results only 1:1, like "FC Barcelona - Real Madrit", won't result with anything if document will have a name like "Football Club Barcelona - Real Madryt" – shareone2 Oct 03 '18 at 09:34
@shareone2 - That is a different use case and technically not a duplicate as they are not an exact match. You may want to look into a [More Like This](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html) query instead. – Tim Oct 03 '18 at 15:05

Elasticsearch: Find duplicates by field

1 Answers1