Select Distinct/Unique Rows from a Huge Table in SQL Server Without Timeout Errors

Question

I have a table in SQL Server with around 400,000 rows. Unfortunately, this table has many duplicate rows, as there isn't a primary key. I'm using SELECT DISTINCT to get all unique rows, but as there's so many of them, I keep getting timeout errors. What's the best way to approach this problem?

You wont get a proper answer unless you post your table definition (script it out in SSMS with right-click script table as... CREATE), any indexes defined against the table and your query. Best guess if that you should create a clustered index on the column your distinct is on. — Nick Kavadias, Apr 06 '10 at 02:59
@Nick I'll do that, but, fyi, my distinct is on all the columns, not just one. — Maxim Zaslavsky, Apr 06 '10 at 04:50

score 2 · Accepted Answer · answered Apr 04 '10 at 05:27

2

Good question. Improove timeout - raise the value. Or throw in more power. Technically 400.000 rows is not a lot, so it would go very fast for a decent server. Otherwise - if you have IO problems because your "server" has only one (pair of) hard discs.... then your IO Will kill you.

Basically, DISTINCT triggers a temporary result creation in the tempdb - so it is tempdb heavy.

But at the end, I would increase timeout, especially if this is a one time operation only.

Btw., to put that into perspective - where is the HUGH table you talk about? 400.000 is trivial in size.

answered Apr 04 '10 at 05:27

TomTom

51,649
7
54
136

How do I raise the timeout value? Thanks for your answer! About the 400,000 = huge thing, it feels huge for me, but it's subjective. – Maxim Zaslavsky Apr 04 '10 at 08:27
Actually it is not subjective - pretty much standard anything below one million is trivial (not even small). 1.000.000 records where small 10 yars ago (Foxpro) ;) TImeout - check command, connection object. CHeck connection string parameters. – TomTom Apr 04 '10 at 09:15
For the timeout thing, I'm running the query directly from SQL Server Management Server. Is there a way to set it there? – Maxim Zaslavsky Apr 04 '10 at 19:21

score 0 · Answer 2 · answered Apr 04 '10 at 05:25

0

The best solution for speeding this up is to add an index on the field that you need to select against. Using a select statement where the database engine has to filter results without an index essentially requires the server to walk through every row in the table.

answered Apr 04 '10 at 05:25

Ron

639
5
2

Will not help. A distinct NOT On only the key will trigger a table scan anyway, index or not. – TomTom Apr 04 '10 at 05:28
True, I should clarify that to be effective, the index would have to be on all fields which are the target of the DISTINCT. Re-reading the question though I'm curious what the structure of this table looks like and the exact query that's being run against it. The timeout could be a symptom of a bigger issue with the overall design of the database. – Ron Apr 04 '10 at 05:43
1

Most likely a bigger issue with the hardware. Unless the table is unusualle wide - 400.000 rows are comically low a size. That is unless this is a one disc system that overloads the IO. I have distincted 130 million record tables without problems in a minute or so - 400.000 is nothing for even a small but properly set up server. – TomTom Apr 04 '10 at 06:16
1

DISTINCT is a form of aggregate... an index will help if it's not SELECT * and you can limit it to keys. Otherwise an index won't help if it has to over all columns = table scan anyway – gbn Apr 04 '10 at 07:46

Select Distinct/Unique Rows from a Huge Table in SQL Server Without Timeout Errors

2 Answers2