5

I have a table (Orders) with CLUSTERED IDENTITY PRIMARY KEY (OrderId) and I am filtering and sorting the data by add date column (AddDate). Is there a way to tell the query optimizer that the AddDate is ordered the same way the OrderId is (so the data is ordered by AddDate already)?

SQL Server does not really need to Scan whole table and then sort it. All the operation requires is to Scan the table until end date is found, then filter out data before start date and return it as is (without sorting it).

Example:

SELECT
      *
    FROM Orders
    WHERE AddDate BETWEEN @FromDate AND @ToDate
    ORDER BY AddDate
  • 2
    Why not add a non-clustered index on your column `AddDate`? SQL Server doesn't have the same knowledge of the data you do. You might well know that `OrderID` and `AddDate` share an order, but your DBMS has no idea this is the case. Thus it's only option is to check every single value. Adding an index gives your DBMS this information, and thus it can make a more informed, and optimised, query plan. – Thom A Oct 02 '18 at 11:45
  • Clustered: because it needs to be on the `OrderId` since that's the primary way system accesses the table data. Non-Clustered: because it would have to contain almost all the columns of the table to be useful (double table size and update times). Otherwise it would use Key Lookup which is too slow for that many rows. – Jozef Babinsky Oct 02 '18 at 11:51
  • 2
    Unless you change your existing indexes or add a suitable one, then a table scan is the DBMS' only choice. – Thom A Oct 02 '18 at 11:53
  • 3
    You could change the PK index to non-clustered and create a clustered index on date to optimize queries that order by date and select all/most columns. Whether or not that's the best approach depends on the other queries in your workload. – Dan Guzman Oct 02 '18 at 11:55
  • @DanGuzman Overwhelming majority of data addressing uses `OrderId`, so no. – Jozef Babinsky Oct 02 '18 at 12:01
  • Another approach would be to use the `OrderId` column in the `order by` clause. True, it's less readable for a human, but will let SQL Server scan the clustered index of the table for the sort, and you can always have a comment explaining this order by clause (That's what comments are for...) – Zohar Peled Oct 09 '18 at 13:55
  • I'm guessing you're trying to reorder the data by date, so you'll make the recent data at the first page. if that the case, (and non-clustered index on AddDate is not working solution for you) you can just reverse the order of the clustered index, which will make recent AddDate dates at the top page. – iSR5 Oct 09 '18 at 15:13
  • I both sort and **filter** by `AddedDate`. – Jozef Babinsky Oct 10 '18 at 09:55
  • @Damien_The_Unbeliever I clearly wrote, that I know what is the nature of the data and am asking if there is a way to relay it to SQL Server. What is the point of your comment? Either tell me that there is something incorrect in my analysis or tell me the answer. If you have nothing to say, please, don't spam. – Jozef Babinsky Oct 10 '18 at 14:08
  • @JozefBabinsky that way is by using an index. There are many wrong assumptions in the question. The most important is that the *storage* order matters or is preserved in any way. **It isn't**. Multiple operators and parallel processing can *change* the order the results are produced. Unless you specify an `ORDER BY` results will come out in whatever way is more convenient. – Panagiotis Kanavos Oct 12 '18 at 09:29
  • @JozefBabinsky second, it's not only that there's no order in the results. There's no way to make that filter work fast. Where will the server find and filter that data? An index is a B-Tree structure. It's fast because it's a B-Tree, not because it's ordered. Even if the results were returned in date order, performance would *still* be very slow as the server would have to scan the entire table to find matches, reading the entire row each time to find that one `AddDate` field – Panagiotis Kanavos Oct 12 '18 at 09:31
  • @JozefBabinsky what are you trying to achieve? Why don't you add an *index*? How are the results of this query used? If you want to display or export a date's orders, you need that index. If you want to generate a report, perhaps a different schema, a Calendar table or a columnstore index would be useful. Please explain the real problem behind this question – Panagiotis Kanavos Oct 12 '18 at 09:37
  • @JozefBabinsky are you trying to replicate data, or reach changes since the last time a job run? There are ways to do that too, that don't involve using date fields. You could use change tracking for example to find the rows that changed since a specific database version, even deleted ones. Replication has its own mechanisms to detect and replicate modified rows – Panagiotis Kanavos Oct 12 '18 at 09:50
  • @JozefBabinsky it has everything to do with the question. The clustered index has nothing to do with the query, the date column nor does it affect how it's processed. Since order isn't preserved and there's no index to seek, the server has to scan and sort. If you can't explain how you want to use that query, the only thing one can say (and everyone else already suggested) is to add an index – Panagiotis Kanavos Oct 12 '18 at 11:00
  • @PanagiotisKanavos I understand how indices and data order works. I was talking about nature of the data, not about it being ordered in the database file. I am aware that order is not preserved, but it has nothing to do with the question. – Jozef Babinsky Oct 12 '18 at 11:02
  • @JozefBabinsky the only other suggestion (without concrete information) would be to suggest a Columnstore index. It's available in all editions since SQL Server 2016 SP1 and effectively indexes all columns. – Panagiotis Kanavos Oct 12 '18 at 11:02
  • @JozefBabinsky then I'm afraid you won't get any answer. The good answers have already been posted – Panagiotis Kanavos Oct 12 '18 at 11:04
  • @PanagiotisKanavos There is no more concrete information. The example query is pretty clear :). "No there is no way to do it" is perfectly valid answer. – Jozef Babinsky Oct 12 '18 at 11:05
  • The use of "SELECT *" might be forcing you r query to always use the clustered index. You need to also include what other indexes are defined on the table and also what columns do you really have to return./ If you do need to return all the columns then I don't see any improvements that can be made but if it is a subset then a non-clustered index will be your best friend. – Martin Cairney Oct 16 '18 at 04:28

3 Answers3

4

Is there a way to tell the query optimizer that the AddDate is ordered the same way the OrderId is (so the data is ordered by AddDate already)?

No, there isn't any way to do that.

However you can order by OrderId instead of by AddDate, if AddDate is ordered the same way as OrderId, it would return the same results. But unfortunatly, SQL Server would scan the entire table anyway.

Let't take Northwind Orders tables and OrderDate column.

The query:

SELECT *
FROM dbo.Orders
WHERE OrderDate BETWEEN '1997-12-10' AND '1998-03-05'
ORDER BY OrderDate

Produces this plan. It scans the clustered index entirely while applying a filter, then order the result.

The query:

SELECT *
FROM dbo.Orders
WHERE OrderDate BETWEEN '1997-12-10' AND '1997-12-17'
ORDER BY OrderId -- It's equivalent to ordering by OrderDate

Produces this plan. It also scans the clustered index entirely and apply a filter, but it doesn't order.

Having OrderDate as the clustered index key would drastically improve the performance of the query, but you might not want to have such a clustered index key. However you cloud create a covering index that would also drastically improve the performance:

CREATE INDEX IX_Orders_OrderDate ON dbo.Orders(OrderDate)
INCLUDE ([OrderID], [CustomerID], [EmployeeID], [RequiredDate], [ShippedDate], [ShipVia], [Freight], [ShipName], [ShipAddress], [ShipCity], [ShipRegion], [ShipPostalCode], [ShipCountry])

The query:

SELECT *
FROM dbo.Orders
WHERE OrderDate BETWEEN '1997-12-10' AND '1998-03-05'
ORDER BY OrderDate

Produces this plan. It just seeks the index. It cannot be faster.

But that index is fat, it would penalize data modifications.

However, you could take advantage of a thinner index like the following:

CREATE INDEX IX_Orders_OrderDate ON dbo.Orders(OrderDate, OrderId)

Using a query like this:

DECLARE @FromOrderId int, @ToOrderId int;
SELECT TOP (1) @FromOrderId = OrderId FROM dbo.Orders WHERE OrderDate <= '1997-12-10' ORDER BY OrderDate DESC, OrderId DESC;
SELECT TOP (1) @ToOrderId = OrderId FROM dbo.Orders WHERE OrderDate >= '1998-03-05' ORDER BY OrderDate ASC, OrderId ASC;

SELECT *
FROM dbo.Orders
WHERE 
    (OrderId  >= @FromOrderId OR @FromOrderId IS NULL)
    AND (OrderId <= @ToOrderId OR @ToOrderId IS NULL)
ORDER BY OrderID
OPTION (RECOMPILE)

It produces this plan. It just need 3 seeks to solve the query.

Jesús López
  • 8,338
  • 7
  • 40
  • 66
  • your plans may be tainted by the fact you are using "SELECT *" in your queries. [This atricle](https://use-the-index-luke.com/blog/2013-08/its-not-about-the-star-stupid) gives a decent initial discussion on why. Without knowing the full set of indexes on the table it isn't easy to define what will work better, but as the sample query has SELECT * also then it is almost always going to use the clustered index – Martin Cairney Oct 16 '18 at 04:25
  • @MartinCairney I used SELECT * just because I don't know which columns are needed. What I can do then?. Incidentally, the third query doesn't use the clustered index because the index includes every column in the table, it is like a second clustered index. I know SELECT * is evil, but I don't believe on absolute sentences like "never ever do that because it's evil" – Jesús López Oct 16 '18 at 08:40
  • OR [@FromOrderId] IS NULL is not require because it is guarantee that [@FromOrderId] will not null. Hence no need of OPTION (RECOMPILE) Also why you will use order by on 2 column ,ORDER BY OrderDate DESC, OrderId DESC.Only ORDER BY OrderDate DESC on 2 column will do. – KumarHarsh Oct 16 '18 at 10:11
  • IMO, I think overall performance matter, which only testing can reveal .So may be your second index is enough. Also even If I go with your third option then third option query is ok but third option index is not require.Second option index will work in both cases. – KumarHarsh Oct 16 '18 at 10:25
  • @KumarHarsh `@FromOrderId `will be NULL if there are no orders older than the especified `OrderDate`, so `OR [@FromOrderId] IS NULL` is required and `OPTION (RECOMPILE)`. I need both `ORDER BY OrderDate DESC, OrderId DESC` because there can be more than one order with the same `OrderDate`, including both will ensure I pick the correct `OrderId`. – Jesús López Oct 16 '18 at 13:16
  • @KumarHarsh, my second index might be enough, but it would produce key lookups, which are expensive. And worse, imagine the first time you pass a tight interval, SQL Server, decides to use the index, that's ok, but later you pass a wide interval, SQL Server uses the cached plan, Boom! the query takes a lot because it performs lots of key lookups. – Jesús López Oct 16 '18 at 13:22
  • Suppose [@FromOrderId IS NULL] then you can add condition and populate [@FromOrderId] with min value of orderid ,as we should avoid OPTION (RECOMPILE). as far as possible.I mean your second index and last query,it won't produce key loop up.Anyway it will depend upon trying what we discuss and follow the most performant way.Thanks. – KumarHarsh Oct 17 '18 at 03:27
  • 1
    @KumarHarsh. Performance is the question, yes. So what's the point? the last query is solved very efficiently, with just 3 seeks, is it not fast enough for you?. I use OPTION (RECOMPILE) only when worth it, like in this case. The second index, if used (in the first or second query), would produce key lookups ended, it doesn't contains all field listed on the statement, so SQL Server needs to pick them form the clustered index. – Jesús López Oct 17 '18 at 17:15
2

There is no more you can do. First of all if you know that this is sorted the same way then there is no reason why would you order your result set by AddDate instead of OrderId. Secondly if you know about this relation you can just get the OrderId assigment for the date frame you are after as @Jesus showed you. Best method will be to create additional index on OrderDate (only). For sake of such a simple query this will work best, no need to include extra columns, PK will be covered in the background anyway and based of in the whole filtering will be applied. Last option you have which still will bring some improvements (not as good as index, but still can be helpful) is to add extra statistics on the OrderDate column, based on them the SQL Server kardionality estimator will be able to produce better plan wchich should drive to better performance.

Bartosz X
  • 2,620
  • 24
  • 36
1

You could add a non clustered index on AddDate, OrderId:

CREATE INDEX IX_Orders_AddDate_OrderID
ON dbo.Orders(AddDate, OrderID)

Then rewrite your query:

SELECT       *
FROM Orders
WHERE OrderId >=
           (SELECT MIN(OrderId)
            FROM dbo.Orders
            WHERE AddDate >= @FromDate) AND
      OrderId <=
           (SELECT MAX(OrderId)
            FROM dbo.Orders
            WHERE AddDate <= @ToDate)
ORDER BY AddDate
Steve Ford
  • 7,433
  • 19
  • 40