1

I am using elasticsearch 7.8 and I have the entries in the index like below,

{"_id" : 1,"sourceip":"1.1.1.1", "data" : "this is a sample input", "processedflag" : true}
{"_id" : 2,"sourceip":"1.1.1.1", "data" : "this is a sample input", "processedflag" : false}
{"_id" : 3,"sourceip":"1.1.1.1", "data" : "this is an another input", "processedflag" : false}
{"_id" : 4,"sourceip":"1.1.1.2", "data" : "this is a sample input", "processedflag" : false}

Now for the sourceip : 1.1.1.1, I want to aggregate and find the duplicates of "data",
For example in the above case, I want to get the _id of 1 and 2 entries since the data is matched.

Thanks,
Harry

Harry
  • 3,072
  • 6
  • 43
  • 100

1 Answers1

1

Looking at your data, I've only considered the first three fields and based on it, created the mapping, documents, query and response.

Mapping:

PUT my_ip_index
{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "sourceip":{
        "type": "ip"
      },
      "data":{            
        "type": "keyword"              <----- Notice this though
      }
    }
  }
}

Sample Documents:

POST my_ip_index/_doc/1
{
  "id": 1,
  "sourceip": "1.1.1.1",
  "data": "this is a sample input"
}

POST my_ip_index/_doc/2
{
  "id": 2,
  "sourceip": "1.1.1.1",
  "data": "this is a sample input"
}

POST my_ip_index/_doc/3
{
  "id": 3,
  "sourceip": "1.1.1.1",
  "data": "this is an another input"
}

POST my_ip_index/_doc/4
{
  "id": 4,
  "sourceip": "1.1.1.2",
  "data": "this is a sample input"
}

POST my_ip_index/_doc/5
{
  "id": 5,
  "sourceip": "1.1.1.2",
  "data": "this is a sample another input"
}

Only the first two documents are equal i.e. having same ip as well as data

Aggregation Request:

POST my_ip_index/_search
{
  "size": 0,
  "aggs": {
    "my_ip_address": {
      "terms": {
        "field": "sourceip",
        "min_doc_count": 2                          <---- Note this
      },
      "aggs": {
        "my_data": {
          "terms": {
            "field": "data",
            "min_doc_count": 2                      <---- Note this
          },
          "aggs": {
            "my_duplicate_ids":{
              "terms": {
                "field": "id",
                "size": 10
              }
            }
          }
        },
        "min_bucket_selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "my_data._bucket_count" 
            },
            "script": {
              "source": "params.count > 0"
            }
          }
        }
      }
    }
  }
}

Note that I've made use of the below aggregations and notice in particular the structure

Also notice how I've made use of bucket_count special path in the bucket script aggregation part.

Response:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_ip_address" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "1.1.1.1",                          <---- IP
          "doc_count" : 3,
          "my_data" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "this is a sample input",     <---- data
                "doc_count" : 2,
                "my_duplicate_ids" : {
                  "doc_count_error_upper_bound" : 0,
                  "sum_other_doc_count" : 0,
                  "buckets" : [
                    {
                      "key" : "1",                    <---- id you are looking for
                      "doc_count" : 1
                    },
                    {
                      "key" : "2",                    <---- id you are looking for
                      "doc_count" : 1
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Hope that helps!

Kamal Kunjapur
  • 8,547
  • 2
  • 22
  • 32
  • Thanks @Opster ES Ninja - Kamal, I am facing another serious issue : https://stackoverflow.com/questions/62783397/aggregation-in-elasticsearch-across-indices-is-not-working Could you help me on this? – Harry Jul 07 '20 at 20:29
  • @Harry, sure will look into it and update you. Also I'm sorry I couldn't find a solution related to the user authorisation query you've posted a while back but I'll do some more research on that as well and let you know. – Kamal Kunjapur Jul 07 '20 at 20:32
  • No issues, I researched and found the roles for it. I can share it in the same question. But this seems to be a hiccup for me : https://stackoverflow.com/questions/62783397/aggregation-in-elasticsearch-across-indices-is-not-working – Harry Jul 07 '20 at 20:36
  • @Harry, sure feel free to go ahead and post you answer and accept it as well. I will see what I can do with the latest question you've mentioned – Kamal Kunjapur Jul 07 '20 at 21:16
  • Sure Thanks, Please help me on this question – Harry Jul 08 '20 at 03:16
  • Please look into this question : https://stackoverflow.com/questions/62843709/uncategorizedexecutionexceptionfailed-execution-nested-ioexceptionconnectio – Harry Jul 13 '20 at 14:27