Performance Strategy for growing data

Question

I know that performance tuning is something which need to be done specific to each environment. But I have put maximum effort to make my question clear to see if I am missing something in the possible improvements.

I have a table [TestExecutions] in SQL Server 2005. It has around 0.2 million records as of today. It is expected to grow as 5 million in couple of months.

CREATE TABLE [dbo].[TestExecutions]
(
    [TestExecutionID] [int] IDENTITY(1,1) NOT NULL,
    [OrderID] [int] NOT NULL,
    [LineItemID] [int] NOT NULL,
    [Manifest] [char](7) NOT NULL,
    [RowCompanyCD] [char](4) NOT NULL,
    [RowReferenceID] [int] NOT NULL,
    [RowReferenceValue] [char](3) NOT NULL,
    [ExecutedTime] [datetime] NOT NULL
) 

CREATE INDEX [IX_TestExecutions_OrderID] 
ON [dbo].[TestExecutions] ([OrderID]) 
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])

I have following two queries for same purpose (Query2 and Query 3). For 100 records in #OrdersForRC, the Query2 is working better (39% vs 47%) whereas with 10000 records in in #OrdersForRC the Query 3 is working better (53% vs 33%) as per the execution plan).

In the initial few months of use, the #OrdersForRC table will have close to 100 records. It will gradually increase to 2500 records over a couple of months.

In the following two approaches which one is good for such a incrementally growing scenario? Or is there any strategy to make one approach work better than the other even if data grows?

Note: In Plan2, the first Query uses Hash Match

References

Test Query

CREATE TABLE #OrdersForRC 
(
    OrderID INT
)

INSERT INTO #OrdersForRC
--SELECT DISTINCT TOP 100 OrderID FROM [TestExecutions]
SELECT DISTINCT TOP 5000 OrderID FROM LWManifestReceiptExecutions

--QUERY 2:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
INNER JOIN #OrdersForRC R
    ON R.OrderID = H.OrderID

--QUERY 3:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
WHERE OrderID IN (SELECT OrderID FROM #OrdersForRC)

DROP TABLE #OrdersForRC

Plan 1

Plan 2

Hey @Lijo can you please create table and indexes of LWManifestReceiptExecutions table — Neeraj Prasad Sharma, Feb 26 '16 at 06:38
do you want SELECT DISTINCT TOP 5000 OrderID FROM LWManifestReceiptExecutions unordered or you can accept it order by orderid? — Neeraj Prasad Sharma, Feb 26 '16 at 06:46

Neeraj Prasad Sharma · Answer 1 · 2016-02-26T11:09:06.727

AS commented above you have not specified table definition of table LWManifestReceiptExecutions and how many rows in it and You are selecting Top N rows without order by, Do you want TOP N random id or in a specific order or order does`t matter for You?

if order does matter then you can create a index on column which you required in Order By if order id is unique in [dbo].[TestExecutions] table then you should mark it as unique drop and recreate the index if UNIQUE

 Drop Index [IX_TestExecutions_OrderID]  ON [dbo].[TestExecutions]
  CREATE UNIQUE INDEX   [IX_TestExecutions_OrderID] 
ON [dbo].[TestExecutions]  ([OrderID])  
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])

You asked that data is keep growing and it will reach to millions in couple of months. No need to worry sql server can easily handle these query with proper build schema and indexes, When this data model starting hurting then you could look at the other options but not now, i have seen people handling billions of data in sql server.

I can see you are comparing the queries on the bases of query cost you are coming the conclusion that Query with higher percentages mean this is more expensive,

That is not the case always query cost is based on aggregate Subtree cost of all Iterator in the query plan, and the total estimated cost of an Iterator is a simple sum of the I/O and CPU components. The cost values represent expected execution times (in seconds) on a particular hardware configuration But with the morden hardware these cost might be irrelevant.

Now coming to your query, You have expressed two queries to get the result but both are not identical,

IN PLAN 1 Query 1

Expressed by JOIN

QO is choosing Nested loop join that is good choice for particular this scenario Every row for the key OrderID IN table #OrdersForRC seeking the value in the table dbo.[TestExecutions] until all rows matched

IN PLAN 2 Query 2

Expressed by IN

QO is doing the same thing as query one but there is extra distinct Sort ( Sort and Stream aggregated) the reasoning behind it is you have expressed this query as IN and table #OrdersForRC can contain duplicate Rows Just to eliminate that is necessary.

IN PLAN 2 Query 1

Expressed by JOIN

Now the Rows in the table in #OrdersForRC in 1000, QO is choosing hash join over loop join Because loop join for 1000 rows has more cost than hash join and loop join and rows are unordered and can consist nulls as well thus HASH JOIN is perfect stratergy here.

IN PLAN 2 Query 2

Expressed by IN

QO has chosen Distinct Sort for the same reason as chosen in Plan 2 query 2 and then Merge Join Because rows are now sorted ON ID column for both tables.

IF you just mark temp table as NOT NULL and Unique then its more likly you will get the same execution plan for both IN the JOIN.

CREATE TABLE #OrdersForRC 
(OrderID INT not null Unique)

Execution plan

Performance Strategy for growing data

1 Answers1