2

Im working with elasticsearch. I got collection of events, where are event names, for ex. FC Barcelona - Real Madrit, then somewhere in collection may be Footbal Club Barcela - FC Real Madryt.

I need to find minimum 2 hits without query text. I think aggregation and ngram tokenizer should be used here, but I'm not sure.

Here are my index settings:

{
        "settings": {
            "analysis": {
                "analyzer": {
                    "test": {
                        "tokenizer": "test",
                        "filter": ["lowercase", "word_delimiter", "nGram", "porter_stem"]
                        "token_chars": [
                            "letter",
                            "digit",
                            "whitespace"
                        ]
                    }
                },
                "tokenizer": {
                    "test": {
                        "type": "ngram",
                        "min_gram": 3,
                        "max_gram": 15,
                    }
                }
            }
        }
    }

And that's how my current query looks like:

{
  "size": 0,
  "aggs": {
    "duplicateNames": {
      "terms": {
        "field": "eventName",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

And here is my mapping:

{
            "event": {
                "properties": {
                    "eventName": {
                        "type": "keyword",
                        // fielddata: true
                    }
                }
            }
        }

Could u point me in the right direction, please?

shareone2
  • 33
  • 1
  • 6

1 Answers1

1

You shouldn't need the nGrams if you are looking for duplicates. You'll want to use the keyword type like you have. You can use the terms aggregation like you already have.

POST <index_name>/event/_search
{
  "size": 0,
  "aggs": {
    "duplicateNames": {
      "terms": {
        "field": "eventName",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

The duplicate eventName will be listed in the duplicateEventNames aggregation buckets. The document _id will be in the top hits in each bucket.

Tim
  • 1,276
  • 11
  • 23
  • Is it possible to filter out items by date? Lets say each event has a eventStart and i want to display only one of these which are starting by same day. – shareone2 Sep 27 '18 at 09:53
  • You probably want to look into a histogram aggregation. Something like [this answer](https://stackoverflow.com/a/47938636/229778). Feel free to ask another question on SO if you're running into issues with it. – Tim Sep 27 '18 at 12:41
  • Well this does not work, because it finds results only 1:1, like "FC Barcelona - Real Madrit", won't result with anything if document will have a name like "Football Club Barcelona - Real Madryt" – shareone2 Oct 03 '18 at 09:34
  • @shareone2 - That is a different use case and technically not a duplicate as they are not an exact match. You may want to look into a [More Like This](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html) query instead. – Tim Oct 03 '18 at 15:05