Can I optimize a SELECT DISTINCT x FROM hugeTable query by creating an index on column x?

Question

I have a huge table, having a much smaller number (by orders of magnitude) of distinct values on some column x.

I need to do a query like SELECT DISTINCT x FROM hugeTable, and I want to do this relatively fast.

I did something like CREATE INDEX hugeTable_by_x ON hugeTable(x), but for some reason, even though the output is small, the query execution is not as fast. The query plan shows that 97% of the time is spent on Index Scan of hugeTable_by_x, with an estimated number of rows equal to the size of the entire table. This is followed by, among other things, a Hash Match operation.

Since I created an index on column x, can I not expect this query to run very quickly?

Note that I'm using Microsoft SQL Server 2005.

Hi There, Is the Column you are Indexing a 'int' field? The idea of an idex on a table is for the system to map out where that index sits in the model, and then make it easier for it to retrieve. If this field has no relevance apart from it just being a value, it really wont make much difference, cause it still needs to scan the table. — Robbie Tapping, May 12 '11 at 06:03
If, say, there are 1000 rows in `hugeTable` with `x=1`, then `hugeTable_by_x` still has to contain 1000 references to those rows in its leaf level for `x=1`. And if those references are wide (what's the clustering key for `hugeTable`?), the index is going to be pretty large itself. — Damien_The_Unbeliever, May 12 '11 at 06:37
Note that I also tried `SELECT x FROM hugeTable GROUP BY x`, and it gives exactly the same query plan. — polygenelubricants, May 12 '11 at 06:50

score 28 · Answer 1 · edited Oct 24 '15 at 08:54

28

This is likely not a problem of indexing, but one of data design. Normalization, to be precise. The fact that you need to query distinct values of a field, and even willing to add an index, is a strong indicator that the field should be normalized into a separate table with a (small) join key. Then the distinct values will be available immediately by scanning the much smaller lookup foreign table.

Update
As a workaround, you can create an indexed view on an aggregate by the 'distinct' field. COUNT_BIG is an aggregate that is allowed in indexed views:

create view vwDistinct
with schemabinding
as select x, count_big(*)
from schema.hugetable
group by x;

create clustered index cdxDistinct on vwDistinct(x);

select x from vwDistinct with (noexpand);

edited Oct 24 '15 at 08:54

Alex Bitek

6,529
5
47
77

answered May 12 '11 at 06:12

Remus Rusanu

288,378
40
442
569

1

While what you say is definitely true, the case you are speaking about may be exactly WHY a `SELECT DISTINCT` is being done, as part of the normalization process. For example, we have a system which brings in a data feed from a set of downloaded FTP files. This data is NOT normalized at all. That is one half of the precise function of our process is to normalize the data as we load it into our system. So for example, we have a query (pseudocode) like `INSERT INTO NORMALIZEDVALUELIST (NAME) SELECT DISTINCT SOMEFIELD FROM UNNORMALIZEDSOURCE WHERE `. – eidylon Mar 06 '14 at 16:21
Concerning the workaround, wouldn't the cpu cost over time of maintaining this indexed view through CRUD operations be higher than that of a simple index on the original table? Then some of other workarounds posted may better choices. – crokusek Dec 24 '16 at 02:04
The indexed view does add a cost to write operations. I think it might actually be cheaper than an index on the column, if there are only a few distinct values and the table is large. If nothing else the disk space taken by the indexed view is then much less than that for the index. So the answer is "it depends". – Ed Avis Nov 06 '17 at 09:57
Hi, the answer is good but the example SQL will fail to run as is. When creating a view you need to name all the columns. Currently the `count_big(*)` column does not have a name so will give an error. I suggest something like `count_big(*) as kount` just to give it a name and let the view definition compile. – Ed Avis Oct 01 '20 at 12:49

score 9 · Answer 2 · edited Sep 18 '19 at 19:51

SQL Server does not implement any facility to seek directly to the next distinct value in an index skipping duplicates along the way.

If you have many duplicates then you may be able to use a recursive CTE to simulate this. The technique comes from here. ("Super-fast DISTINCT using a recursive CTE"). For example:

with recursivecte as (
  select min(t.x) as x
  from hugetable t
  union all
  select ranked.x
  from (
    select t.x,
           row_number() over (order by t.x) as rnk
    from hugetable t
    join recursivecte r
      on r.x < t.x
  ) ranked
  where ranked.rnk = 1
)
select *
from recursivecte
option (maxrecursion 0)

Denis de Bernardy · Answer 3 · 2011-05-12T10:22:04.830

If you know the values in advance and there is an index on column x (or if each value is likely to appear quickly on a seq scan of the whole table), it is much faster to query each one individually:

select vals.x
from [values] as vals (x)
where exists (select 1 from bigtable where bigtable.x = vals.x);

Proceeding using exists() will do as many index lookups as there are valid values.

The way you've written it (which is correct if the values are not known in advance), the query engine will need to read the whole table and hash aggregate the mess to extract the values. (Which makes the index useless.)

score 1 · Answer 4 · edited May 23 '17 at 12:02

No. But there are some workarounds (excluding normalization):

Once the index is in place, then its possible to implement in SQL what the optimizer could be doing automatically:

https://stackoverflow.com/a/29286754/538763 (multiple workarounds cited)

Other answers say you can normalize which would solve your issue but even once its normalized SQL Server still likes to perform a scan to find the max() within group(s). Workarounds:

https://dba.stackexchange.com/questions/48848/efficiently-query-max-over-multiple-ranges?rq=1

score 1 · Answer 5 · answered May 12 '11 at 06:04

1

When doing a SELECT DISTINCT on an indexed field, an index scan makes sense, as execution still has to scan each value in the index for the entire table (assuming no WHERE clause, as seems to be the case by your example).

Indexes usually have more of an impact on WHERE conditions, JOINS, and ORDER BY clauses.

answered May 12 '11 at 06:04

Jerad Rose

15,235
18
82
153

It doesn't really "have to scan each value in the index". Internally it could perform a series of seeks (e.g. binary searches) to find successive changes in value. – crokusek Dec 24 '16 at 01:11

score 0 · Answer 6 · answered Nov 28 '19 at 12:22

0

If your column x has low cardinality, creating local bitmap index would increase the performance many fold.

answered Nov 28 '19 at 12:22

Srinivasarao Kotipatruni

187
1
2
9

score 0 · Answer 7 · answered May 12 '11 at 06:04

As per your description of the execution plan, I would believe it's the best possible execution.

The Index Scan reads the entire index as stored (not in index order), the HASH MATCH does the distinct.

There might be other ways around your problem. In SQL Server, Indexed Views come to my mind. However, that might give you a big hit for write's on that table.

score -1 · Answer 8 · answered May 12 '11 at 05:58

-1

Possibly. Though it is not guaranteed - it entirely depends on the query.

I suggest reading this article by Gail Shaw (part 1 and part 2).

answered May 12 '11 at 05:58

Oded

489,969
99
883
1,009

Can I optimize a SELECT DISTINCT x FROM hugeTable query by creating an index on column x?

8 Answers8

Linked