Do high cardinality fields affect performance for searches?

Question

A high cardinality field consists of a facetable or filterable field that has a significant number of unique values, and as a result, consumes significant resources when computing results

But it's not clear on whether this poor performance is limited to when the fields are specifically used in a filter/facet query, or whether it also affects performance when the field is queried against using search terms.

Can anyone with some deeper Azure Search knowledge weigh in?

score 2 · Accepted Answer · answered Jul 31 '19 at 01:31

After getting clarification from Microsoft, I can confirm that the answer is "no, performance is only affected when using the field in a facet/filter".

This poor performance is limited to when the fields are specifically used in a filter/facet query. The searchable terms will not be affected.

Fields that work best in faceted navigation have low cardinality: a small number of distinct values that repeat throughout documents in your search corpus (for example, a list of colors, countries/regions, or brand names). If the field that has a significant number of unique values, it will consume significant resources when computing the facet navigation. Because each distinct value will be 1 facet and need to be calculated.

At query time, a filter parser accepts criteria as input, converts the expression into atomic Boolean expressions represented as a tree, and then evaluates the filter tree over filterable fields in an index. If the field that has a significant number of unique values, the tree will be deep and consume significant computing resources. Because each unique value will be calculated in filter, there will be no cached result for duplicate items to reduce the calculation.

The searchable fields will not be affected if the fields have a significant number of unique values. Because searchable fields have inverted index to accelerate query. When you load the index, each field's inverted index is populated with all of the unique, tokenized words from each document, with a map to corresponding document IDs. For example, when indexing a hotels data set, an inverted index created for a City field might contain terms for Seattle, Portland, and so forth. Documents that include Seattle or Portland in the City field would have their document ID listed alongside the term.

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

I reached out to MS as well, this is the answer that I got:

“High cardinality” means different things to filterable vs searchable fields. Cardinality for filterable fields amounts to the uniqueness of the full value of the field. For searchable fields, it’s about the aggregate number of indexed terms that results from writing a document to the index. Complex custom analyzers, for example, can bloat the index by producing several tokens for each word in a string. Inverted indexes scale really well, so I wouldn’t be too concerned about having a high number of unique words in the index. But, this should help understand the unit of scale each.

This mention in the documentation is primarily to raise awareness about what contributes to query performance and why they may see reduced performance as they add additional fields to the filter clause. I will add…You can improve the performance of individual queries by scaling up the number of partitions in your service. Going from 1 to 2 not only doubles the storage available to your service, it also doubles the amount of compute power available to execute queries. The data workload is divided roughly equally between each partition. It doesn’t usually equate to exactly twice the performance for your queries, but it can have a significant impact if you are seeing slow queries.

Do high cardinality fields affect performance for searches?

2 Answers2