Elasticsearch - how to do field collapsing and get Distinct results? (actual records, not just counters)

Question

In relational db our data looks like this: Company -> Department -> Office

Elasticsearch version of the same data (flattened):

{
     "officeID": 123,
     "officeName": "office 1",
     "state": "CA",
     "department": {
          "departmentID": 456,
          "departmentName": "Department 1",
          "company": {
                "companyID": 789,
                "companyName": "Company 1",
          }
     } 
},{
     "officeID": 124,
     "officeName": "office 2",
     "state": "CA",
     "department": {
          "departmentID": 456,
          "departmentName": "Department 1",
          "company": {
              "companyID": 789,
              "companyName": "Company 1",
          }
      }}

We need to find department (or company) by providing office information (such as state).

For example, since all I need is a department info, I can specify it like this (we are using Nest)

searchDescriptor = searchDescriptor.Source(x => x.Include("department"));

and get all departments with qualifying offices.

The problem is - I am getting multiple "department" records with the same id (one for each office).

We are using paging and sorting.

Would it be possible to get paged and sorted Distinct results?

I have spent a few days trying to find an answer (exploring options like facets, aggregations, top_hits etc) but so far the only working option I see would be a manual one - get results from Elasticsearch, group data manually and pass to the client. The problem with this approach is obvious - every time I grab next portion, I'll have to get X extra records just in case some of the records will be duplicate; since I don't know X in advance (and number of such records could be huge) will be forced either to get lots of data unnecessarily (every time I do the search) or to hit our search engine several times until I get required number of records.

So far I was unable to achieve my goal using aggregations (all I am getting is document count, but I want actual data; when I try to use top_hits, I am getting data, but those are really top hits (sorted by number of offices per department, ignoring sorting I have specified in the query); here is an example of the code I tried:

            searchDescriptor = searchDescriptor.Aggregations(a => a
            .Terms("myunique",
                t =>
                    t.Field("department.departmentID")
                    .Size(10)
                    .Aggregations(
                        x=>x.TopHits("mytophits", 
                            y=>y.Source(true)
                                .Size(1)
                                .Sort(k => k.OnField("department.departmentName").Ascending())
                                )
                            )
                        )
            );

Does anyone know if Elasticsearch can perform operations like Distinct and get unique records?

Update: I can get results using top_hits (see below), but in this case I won't be able to use paging (looks like Elasticsearch aggregations feature doesn't support paging), so I am back to square one...

{
  "from": 0,
  "size": 33,
  "explain": false,
  "sort": [
    {
      "departmentID": {
        "order": "asc"
      }
    }
  ],
  "_source": {
    "include": [
      "department"
    ]
  },
  "aggs": {
    "myunique": {
      "terms": {
        "field": "department.departmentID",
        "order": {
          "mytopscore": "desc"
        }
      },
      "aggs": {
        "mytophits": {
          "top_hits": {
            "size": 5,
            "_source": {
              "include": [
                "department.departmentID"
              ]
            }
          }
        },
        "mytopscore": {
          "max": {
            "script": "_score"
          }
        }
      }
    }
  },
  "query": {
        "wildcard" : { "officeName" : "some office*" }
  } 
}

no you can't without aggregation, with your mapping. But the problem may be your mapping so. You could use parent-child relationship, then has-child query (https://www.elastic.co/guide/en/elasticsearch/guide/current/has-child.html) — Julien C., Jun 01 '15 at 20:52
@juliendangers - I am trying to avoid parent-child schema (as far as I heard, it takes lots of memory while doing joins) — Steve V, Jun 01 '15 at 22:07
yes, you're right about memory consumption (see https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child-performance.html). However you can emulate parent-child, without using as much RAM, but it's all about tradeoff — Julien C., Jun 01 '15 at 22:43
@juliendangers - how can I emulate parent-child? I don't see any obvious way, except one I am using already (include copy of the parent into each child object) — Steve V, Jun 01 '15 at 23:12
have a look at that part of the doc : https://www.elastic.co/guide/en/elasticsearch/guide/current/relations.html. As I said, it's all about tradeoff. You can use less memory, but you'll need to run multiple queries. It depends of what you're ready to give up. Have you tried with parent-child relationship though, to see if you really run into memory issue ? — Julien C., Jun 02 '15 at 06:22

Elasticsearch - how to do field collapsing and get Distinct results? (actual records, not just counters)

0 Answers0