3

I have a query with an ORDER BY clause that is slow due to the table having over 11 million rows.

I can speed it up dramatically by adding a clustered index on the column in the ORDER BY clause. However, the software creates the query to order by different columns, depending on user settings. And you cannot add more than one clustered index to a table.

My question is: can non-clustered indexes be used to improve ORDER BY performance? Or is there something special about clustered indexes that means I will not be able to sort quickly for all columns?

Note: I've posted my real query and execution plan online but there are other issues that I don't want to go into here. I didn't create the database or write the query. And the query is still very slow even without the IN clause.

Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
  • @RonenAriely: Without going into a big, long thing, is there any reason why we can't talk about general rules for when an index can improve sort performance? (I asked a related question with a simplified query and was prompted to post my real query, which had a lot going on, and then I had to deal with discussions about parts of the original query that had nothing to do with my issue. Why can't we talk about indexes and sort performance?) – Jonathan Wood Aug 24 '18 at 19:37
  • 2
    Because there is no rule that says "in situation X a non-clustered index will improve performance". There are literally books and books on indexes and how they relate to performance. A good indexing strategy is about 60% technical prowess and 40% art gained from both experience and testing. – Sean Lange Aug 24 '18 at 19:41
  • What do you think is a clustered index is, that you give us link to show that there can be only one. This is a basic knowledge. And what do you think is a non-clustered index if you not sure that it can be used to improve performance of sorting (in specific cases)? I recommend to find a nice online tutorial and read about indexes and why we use them. In short: clustered index is the physical order of the rows in the table (therefore how can you have more then one?!?). A non-clustered index is a separate element from the tale that stores data (for example in tree structure). – Ronen Ariely Aug 24 '18 at 19:44
  • Check this doc as a starting point: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described – Ronen Ariely Aug 24 '18 at 19:44
  • @RonenAriely: I'm familiar with the limitation of only one clustered index, that's why I said only one can be added in my question. I have been looking at that link, but there still seems to be a hole in part of my understanding. – Jonathan Wood Aug 24 '18 at 19:46
  • "`general rules for when an index can improve sort performance`" : well we can discuss golden rules but these might work and might not work, which is the meaning of golden rules :-) . It is much simpler to understand what is Nonclustered index and what is Clustered index which is very basic topic and it is a must for anyone who work with databases. Can you try to focus on specific question? I am not sure what is not clear at this point ;-) – Ronen Ariely Aug 24 '18 at 19:48
  • CAN non-clustered indices be used to improve `Order By` performance? Yes. WILL non-clustered indices improve `Order By` performance *in your particular case*? There's no way to know with the limited information that you've provided. – Brian Aug 24 '18 at 19:52
  • @RonenAriely: First off, my real expertise is C/C++/C# and the like. I just do what I need in SQL. Second, I posted a related question and it was suggested that I add a clustered index to speed up the sort. But now my question is how to do the same thing if I might need to sort of one of several different columns. That's the issue. And I added a link to my original query and execution plan. Please don't bring up issues with the `IN` clause. – Jonathan Wood Aug 24 '18 at 19:53
  • @Brian: Now that comment was posted *after* I added the link to my original query and execution plan. – Jonathan Wood Aug 24 '18 at 19:54
  • `First off, my real expertise is C/C++/C#` : OK, this make sense :-) but a database like you described usually managed by DBA. Anyhow, lets try to help you in the database side – Ronen Ariely Aug 24 '18 at 20:01
  • Perhaps a compound nonclustered index would help on `(lprOwner, lprDate`)? And a similar index for `(lprOwner, X`) where X is each column this query may sort on. Keep in mind indexes are expensive to store and maintain, so if users can sort on 2 or 3 columns, fine, but if they can sort on *any* column on the table, this may be a bad solution. – Brandon Aug 24 '18 at 20:22

3 Answers3

8

Non-clustered indexes can absolutely be used to optimize away a sort. Indexes are essentially binary search trees, which means they contain the values sorted in order.

However, depending on the query, you may be putting SQL Server into a conundrum.

If you have a table with 100 million rows, your query will match 11 million of them, like below, is it cheaper to use an index on category to select the rows and sort the results by name, or to read all 100 million rows out of the index pre-sorted by name, and then filter down 89 million of them by checking the category?

select ...
from product
where category = ?
order by name;

In theory, SQL Server may be able to use an index on name to read rows in order and use the index on category to filter efficiently? I'm skeptical. I have rarely seen SQL Server use multiple indexes to access the same table in the same query (assuming a single table selection, ignoring joins or recursive CTE's). It would have to check the index 100 million times. Indexes have a high overhead cost per index search, so they are efficient when a single search narrows down the result set by a lot.

Without seeing a schema, statistics, and exact query, it's hard for me to say what makes sense, but I expect I would find SQL Server would use an index for the where clause and sort the results, ignoring an index on the sort column.

An index on the sort column may be used if you are selecting the entire table though. Like select ... from product order by name;

Again, your milage may vary. This is speculation based off past experience.

Brandon
  • 9,822
  • 3
  • 27
  • 37
  • Thanks for the complete explanation. You can view my query and execution plan [online](https://www.brentozar.com/pastetheplan/?id=Skjuu8sLX), but I don't really want to get into the `IN` clause as it's still very slow if I take it out. – Jonathan Wood Aug 24 '18 at 19:45
  • OK, now we have something to discuss :-) but we still don't have the table structure :-( . Let's discuss your sample a bit: (1) You can see that your query uses a condition "`lprOwner IN `" which mean that for this condition an index might help a lot since it might be used to filter the data faster (without having your table structure and testing the query with indexes we cannot be sure since the data itself in the table can influence the Execution Plan! for small table the server might choose to scan the table and not to use the index, while in another table it might use the index – Ronen Ariely Aug 24 '18 at 19:56
  • @RonenAriely: If I add a clustered index, the query is nearly instantaneous, even with the `IN` clause. So why do I need to make the `IN` clause faster? (And this is why I was reluctant to post my actually query, and why I keep saying I didn't want to get into the `IN` clause.) – Jonathan Wood Aug 24 '18 at 20:03
  • I don't know if you need :-) I only said that in some cases it can help. I meant to continue with more points and I only started from the first :-). Anyhow, In general I highly recommend not to follow the automatic recommendation you get from tool. You should check and test each option. Try to CREATE NONCLUSTERED INDEX as the message recommend you and compare the performance by comparing Execution Plans, Checking the Statistics IO and statistics Time with and without the index, and fir last step you can create several table and compare real execution on all in the same time. – Ronen Ariely Aug 24 '18 at 20:06
  • I am sry, I cannot work in this interface and guide you with million short comments without formatted code... I recommend to find a nice tutorial (I really tried but I it is too annoying to split the comments and in the middle people change their responses and my comment become irrelevant so I started again... this is not a good place to teach) – Ronen Ariely Aug 24 '18 at 20:09
  • The problem with this argument is you are asking for theoretical advice, but then dismissing advice by asserting it does not apply in your case. Taking the `IN` clause away potentially changes the situation as SQL Server has different options as its disposal. See my answer above. Is it cheaper to filter with an index and sort, or scan an index to avoid a sort, and then filter? Well, if you take the `where` clause out, there is no filtering. It changes your situation. – Brandon Aug 24 '18 at 20:10
  • 1
    "`If I add a clustered index`" again, clustered index is the table! if you added clustered index then your data is already sorted, which is why you get fast result. There is no sorting in case that you read all the data by the order of the clustered index. But a table stores the data once which mean it can store it only on one order, which mean that for this query maybe specific clustered index fit but what about your next query?!? You should be familiar with all your system before you design you table structure. and remember that adding non-clustered index means that you store more data – Ronen Ariely Aug 24 '18 at 20:15
  • @Brandon: I'm not sure what I said does not apply in my case. I understand what you are saying about how the `IN` clause can affect which indexes can be used for sorting. I'm just saying I added a clustered index and the query is nearly instantaneous. I felt I found a solution until I learned that the code sometimes had to sort by other columns. So I'm just trying to understand how I might approach that difference. – Jonathan Wood Aug 24 '18 at 20:21
  • Ah, the clustered index helps because the path to select the rows in your execution plan is a table scan. A clustered index really just means your table is stored ordered by `lprDate`, hence SQL Server can skip the sort. – Brandon Aug 24 '18 at 20:23
  • To carry the thought through, you can always simulate a clustered index's benefits by creating a nonclustered index for each column, but you would have to include all of the columns your queries select and filter on. Your query would be blazing fast, but you would basically be storing a copy of the table for each column it has, so 23 copies of the table. Space versus time tradeoff and read versus write tradeoff. (I'm not suggesting you do that - just educating you on how it works) – Brandon Aug 24 '18 at 20:25
  • @RonenAriely: You know, stackoverflow is [trying to encourage people to be more helpful instead of putting down people for what they don't know](https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/). I appreciate any help you offered but I'm not stupid, I've created a lot of useful software and I don't currently have access to a DBA. I'm just looking for a little help and this is not the appropriate forum for extended debates about semantics, links I post, how strange you think my question is. – Jonathan Wood Aug 24 '18 at 20:28
  • @Brandon: Yes, that's exactly what I was getting at. If I added a non-clustered index for each column I may need to sort by, can that benefit performance as much as a clustered index? I understand that slows down updates, but I'll just need to evaluate that. – Jonathan Wood Aug 24 '18 at 20:30
  • Yes, it will. But you most likely need to include every column that your query references in that index. If you do that, you will maximize the query performance for all sorts, at the cost of 23x more disk space and 23x slower inserts/updates/deletes. – Brandon Aug 24 '18 at 20:34
  • "`if I added a non-clustered index for each column`": Each non-clustered index that you add is stored separately from the table data (more data to read and to store). Each non-clustered index that you add needed to be managed, which means cost of performance - In general the **golden rule** is that non-clustered index helps for SELECT queries (if this is the right index to use), but at the same time it reduces the performance of UPDATE, INSERT, DELETE queries since we need to manage the non-clustered index in addition to the data in the table. Moreover, the server has tasks behind the scene... – Ronen Ariely Aug 24 '18 at 20:43
0

Regarding @SeanLange comment on indexes being an art rather than a science, the best foo bar I have seen is where all columns of table were in the primary key. Further, if you are not careful and just create indexes based off every query execution plan, you will probably end up storing more data in indexes than what is on the actual table.

The idea here is to use covered queries. For your case, I've seen clustered indexes that are on an identity field where a non-clustered index contains the primary key (usually a composite primary key) that includes the clustered index column. From there, SELECT based on the primary key and order on the clustered index (its already sorted).

Update:

I just saw the query execution plan. You are getting hit with a table scan, which means that none of the columns in the WHERE clause are contained in either the primary key or an index. As far as the optimizer is concerned, the table is running in heap. Therefore, any index you add that contains (i.e., covers) the columns that are contained within the WHERE clause are likely to be used. As a result, the query will return much faster.

Ideally, you want to see Index Seeks followed by Index Scans. Usually, the optimizer will look for the unique identifier by its ordinal position in the index. What this means is that if the identity column is the first column listed in the index then you should be rewarded with an index seek. If the first column in the index is non-unique, then you will get an index scan. I would not say these are hard and fast rules, but that is my understanding based on the literature I have read and execution plans that I have seen.

J Weezy
  • 3,507
  • 3
  • 32
  • 88
  • Your advice of having an index to cover all queries is how you get the problem you described in the first paragraph, tons of indexes taking up more space than the table itself. You need to balance having a few generic, reusable, but useful indexes with focused special-purpose indexes to cover every single query. – Brandon Aug 24 '18 at 21:00
  • @Brandon How so? I mentioned using the primary key columns only. My example of foo bar was all columns in the table being in the primary key. – J Weezy Aug 24 '18 at 21:04
  • @JWeezy: As I described, the software needs to sort on one of many columns, based on user settings. So I don't see how this address that. Also, I have no idea what *SELECT based on the primary key* means. – Jonathan Wood Aug 28 '18 at 15:30
  • @JonathanWood "SELECT based on primary key" is another way of saying "the primary key is in the WHERE clause" - the optimizer will use the index instead of a table scan (ugly). If you are allowing for user settings then I is there an application interface that the user is using? If yes, why not retrieve the results and let the application sort the data? If no, have you considered using a view? I'm not sure that will help. Allowing for the user to sort by various columns may require an index for each sort combination (ugly). You may want to consider limiting the user's ability to sort. – J Weezy Aug 28 '18 at 15:41
  • @JWeezy: If you read my question, you know that the table has over 11 million rows. Are you asking me why I don't retrieve all rows from the database and then sort them in my application? Sorry, but that's a terrible approach. As far as selecting on a primary key, the software has filter settings that determine what the `WHERE` clause looks like. It is not limited to a single column. – Jonathan Wood Aug 28 '18 at 16:03
  • @JonathanWood I don't know your data model. But it seems like you can use a query with a where clause to massively reduce the number of rows that are selected; deliver that to the application; and then let the application sort the reduced data set. Once in memory, they can order by any column sets they want without loading the server with more requests. You should work with your users to identify the columns that matter most when ordering that matters most to then in order to build the correct index(s). Ideally, you will be able to rationalize those and then build a view based off that. – J Weezy Aug 28 '18 at 16:32
  • @JWeezy: Have you never used software that lets you sort on any column you want? Which of the hundreds of users should I ask which sort columns they are willing to give up? And the problem with limiting the rows returned from the database, sometimes the user's filter result in millions of rows. So yes we can page the data and just retrieve one page at a time, but *we can't do that before we sort the data!* How can we know which rows go on the first page if it isn't already sorted? – Jonathan Wood Aug 28 '18 at 16:41
  • @JonathanWood I do not know what your application is (web UI?). You can deliver the data to the application layer and sort it with the code before displaying it to the user. Additionally, you can partition the table, but all SELECTS should use the partition column, this will make results return much faster. I am offering you alternatives and I caution allowing your uses to be able to return millions of records only to have to re-sort them. – J Weezy Aug 28 '18 at 16:50
  • @JWeezy: You completely ignored my questions. If I don't know which rows go on the first page because the data isn't already sorted, then I cannot retrieve only that limited number of rows. You just repeated the statement from your previous comment. And it makes no difference what my UI is. – Jonathan Wood Aug 28 '18 at 16:52
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/178950/discussion-between-j-weezy-and-jonathan-wood). – J Weezy Aug 28 '18 at 16:52
0

I would guess that the best solution to this problem would be to

  1. Create a surrogate key as the cluster index
  2. Create composite indices for your queries

For example, you have

SELECT a,b,c
FROM tbl
WHERE x=?,y=?,z=?
ORDER BY j,k,l DESC

Then, you create composite index

INDEX xyz_jkl (x,y,z,j,k,l DESC)

This way, you optimize for each query.

The surrogate key is important for queries outside this table. Having it being an AUTO_INCREMENT field also makes INSERT faster.

Also keep in mind that the PRIMARY KEY (clustered index) is always included in the index.

Jiulin Teng
  • 299
  • 3
  • 8