Composite and Terms Aggregations on a field with a high cardinality

Question

I am facing a huge performance problem with ES which results in more than 2 min response.

I have an index that has more than 25M files and composes of the next 4 fields (among others):

...

       "group_write": {
            "type": "text",
            "fields": {
                "raw": {
                    "type": "keyword"
                }
            }
        },
        "user_write": {
            "type": "text",
            "fields": {
                "raw": {
                    "type": "keyword"
                }
            }
        },
        "group_read": {
            "type": "text",
            "fields": {
                "raw": {
                    "type": "keyword"
                }
            }
        },
        "user_read": {
            "type": "text",
            "fields": {
                "raw": {
                    "type": "keyword"
                }
            }
        }
    }

...

I have something like 100K unique users and groups and each field is a list of users/groups that holds ~100 values. For example:

"user_read": ["user_1", "group_1", ...],
"user_write": ["user_1", "group_2", ...]
...

I have 2 kinds of aggregation I am using, composite and terms. Composite aggregations for getting only first X results to display and terms aggregation for prefix search.

Composite aggregation:

{
    "size": 0,
    "aggs": {
        "Group_Read_Permissions": {
            "composite": {
                "sources": [
                    {
                        "Group Read": {
                            "terms": {
                                "field": "group_read.raw"
                            }
                        }
                    }
                ],
                "size": 10
            }
        },
        "Group_Write_Permissions": {
            "composite": {
                "sources": [
                    {
                        "Group Write": {
                            "terms": {
                                "field": "group_write.raw"
                            }
                        }
                    }
                ]
            }
        },
        "User_Write_Permissions": {
            "composite": {
                "sources": [
                    {
                        "User Write": {
                            "terms": {
                                "field": "user_write.raw"
                            }
                        }
                    }
                ]
            }
        },
        "User_Read_Permissions": {
            "composite": {
                "sources": [
                    {
                        "User Read": {
                            "terms": {
                                "field": "user_read.raw"
                            }
                        }
                    }
                ]
            }
        }
    }
}

Terms aggregation:

{
    "size": 0,
    "aggs": {
        "Group_Read_Permissions": {
            "terms": {
                "field": "group_read.raw",
                "include": ".*[Ss].*"
            }
        },
        "Group Write Permissions": {
            "terms": {
                "field": "group_write.raw",
                "include": ".*[Ss].*"
            }
        },
        "User Read Permissions": {
            "terms": {
                "field": "user_read.raw",
                "include": ".*[Ss].*"
            }
        },
        "User Write Permissions": {
            "terms": {
                "field": "user_write.raw",
                "include": ".*[Ss].*"
            }
        }
    }
}

Composite aggregation returns results within 1 min and the terms aggregation can take up to 5 min.

What I have tried so far:

Adding new field user_group_permissions and adding to the above 4 fields "copy_to": "user_group_permissions"
Adding to the above 4 fields and to the field "user_group_permissions" the next property: "eager_global_ordinals": true
Increased the refresh_interval up to 200s

** I reindexed for the first 2 suggestions [took something like 6 hours]

All of the above did help a little with the retrieval time but still: composite aggregation takes up to 20s and terms aggregation takes up to 3 min. [The best results were on the fields user_group_permissions which has been created in the first suggestion, with eager_global_ordinals = true and refresh_interval = 120s].

Please, if someone has any idea how to improve the retrieval times I will be grateful.

score 0 · Answer 1 · answered Aug 26 '22 at 03:43

0

First of all, if you only need the first 10 results, you don't need to use the composite aggregation, which is meant to be used only if you need to paginate over all results. Simply use the terms aggregation with default size 10, that'll do the job.

Second, what you're doing with the terms aggregation is not a prefix filtering, but infix filtering, which is completely different in terms of performance. While it's easy to search for prefixes, searching for infixes requires the equivalent of a "full table scan" because each and every term must be visited.

A first optimization I would suggest is that in your second query you should do your regex in the query part (bool/should with one regex query per field), so as to reduce the document set on which the terms aggregations need to run. That might help a bit.

A second optimization is to leverage the wildcard field type which is a specialized field type made specially for grep-like wildcard and regexp queries.

Another possible optimization is to lowercase all your permissions, so that you only need to search for .*s.* instead of the uppercase variant.

Depending on your comments, I'll add more optimizations as the discussion goes on.

answered Aug 26 '22 at 03:43

Val

207,596
13
358
360

First, thanks for the reply! The composite aggregation was used since it is not sorting the results as opposed to the terms aggregation, therefore making it a faster one. Regarding the prefix filtering you are absolutely right, the intention was prefix and not infix. I did fix the query but still faced a long retrieving time(220s). Some conclusions about the suggestions: 1. A query of regex for each field indeed helped a little but not significantly. 2. wildcard and lowercase have not optimized the aggregation, wildcard even made it slower a little. Looking forward to your reply. – Daniel D Aug 26 '22 at 10:59
Ok, you can still use the `terms` aggregation and sort by key instead of sorting by count. Thanks for testing the suggestions. Concerning 1, yes that was expected as `regexp` queries are resource hogs. I'm surprised the wildcard field type doesn't help, though. Note that I'm not talking about the `wildcard` query, which is something different. – Val Aug 26 '22 at 11:22
Yes, I have reindexed after changing the field type to wildcard and it was slower. Do you think that maintaining a list of values inside those fields is the right call? I think that nested might be slower but not for sure, what do you think? – Daniel D Aug 29 '22 at 09:08
What kills your performance is the fact that you're searching within the string (i.e. infix). Another optimization that could help here (if used properly) is to use an [ngram tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html) (+ lowercase), it will make your index a tad bigger, but it could solve the infix search problem and might even get rid of the need for regexp. – Val Sep 02 '22 at 05:50

Composite and Terms Aggregations on a field with a high cardinality

1 Answers1