0

I have a table in SQL Server with around 400,000 rows. Unfortunately, this table has many duplicate rows, as there isn't a primary key. I'm using SELECT DISTINCT to get all unique rows, but as there's so many of them, I keep getting timeout errors. What's the best way to approach this problem?

Maxim Zaslavsky
  • 211
  • 1
  • 4
  • 13
  • 1
    belongs on stackoverflow? – gbn Apr 04 '10 at 07:44
  • You wont get a proper answer unless you post your table definition (script it out in SSMS with right-click script table as... CREATE), any indexes defined against the table and your query. Best guess if that you should create a clustered index on the column your distinct is on. – Nick Kavadias Apr 06 '10 at 02:59
  • @Nick I'll do that, but, fyi, my distinct is on all the columns, not just one. – Maxim Zaslavsky Apr 06 '10 at 04:50

2 Answers2

2

Good question. Improove timeout - raise the value. Or throw in more power. Technically 400.000 rows is not a lot, so it would go very fast for a decent server. Otherwise - if you have IO problems because your "server" has only one (pair of) hard discs.... then your IO Will kill you.

Basically, DISTINCT triggers a temporary result creation in the tempdb - so it is tempdb heavy.

But at the end, I would increase timeout, especially if this is a one time operation only.

Btw., to put that into perspective - where is the HUGH table you talk about? 400.000 is trivial in size.

TomTom
  • 51,649
  • 7
  • 54
  • 136
  • How do I raise the timeout value? Thanks for your answer! About the 400,000 = huge thing, it feels huge for me, but it's subjective. – Maxim Zaslavsky Apr 04 '10 at 08:27
  • Actually it is not subjective - pretty much standard anything below one million is trivial (not even small). 1.000.000 records where small 10 yars ago (Foxpro) ;) TImeout - check command, connection object. CHeck connection string parameters. – TomTom Apr 04 '10 at 09:15
  • For the timeout thing, I'm running the query directly from SQL Server Management Server. Is there a way to set it there? – Maxim Zaslavsky Apr 04 '10 at 19:21
0

The best solution for speeding this up is to add an index on the field that you need to select against. Using a select statement where the database engine has to filter results without an index essentially requires the server to walk through every row in the table.

Ron
  • 639
  • 5
  • 2
  • Will not help. A distinct NOT On only the key will trigger a table scan anyway, index or not. – TomTom Apr 04 '10 at 05:28
  • True, I should clarify that to be effective, the index would have to be on all fields which are the target of the DISTINCT. Re-reading the question though I'm curious what the structure of this table looks like and the exact query that's being run against it. The timeout could be a symptom of a bigger issue with the overall design of the database. – Ron Apr 04 '10 at 05:43
  • 1
    Most likely a bigger issue with the hardware. Unless the table is unusualle wide - 400.000 rows are comically low a size. That is unless this is a one disc system that overloads the IO. I have distincted 130 million record tables without problems in a minute or so - 400.000 is nothing for even a small but properly set up server. – TomTom Apr 04 '10 at 06:16
  • 1
    DISTINCT is a form of aggregate... an index will help if it's not SELECT * and you can limit it to keys. Otherwise an index won't help if it has to over all columns = table scan anyway – gbn Apr 04 '10 at 07:46