2

Platform: SQL Server 2012

Background: I have two fairly large log tables - around 600k records each that are being joined using Pk/Fk. For the sake of argument, lets call them ReallyBigLog1 and ReallyBigLog2. The query (below) takes about 3.5-sec to run. The WHERE clause includes three different values. When asked to help improve this query, I immediately noticed that the items in the WHERE clause were not indexed. I smugly suggested adding indexes - assuming the increased performance would make me look like a hero. However, the additional index had no measurable affect.

Question: Given the query below, why does indexing StartTime, EndTime, and DateStamp have no measurable affect on query time?

Query

SELECT 

    IrreleventField1,
    IrreleventField2,
    IrreleventField3....

    FROM  [dbo].[ReallyBigLog1] AS [T1]

    INNER JOIN [dbo].[ReallyBigLog2] AS [T2] ON [T1].[Id] = [T2].[Id]

    WHERE ([T1].[EndTime] IS NOT NULL) AND ([T1].[StartTime] IS NOT NULL) AND ([T2].[DateStamp] >= '2017-5-16 00:00:00')

Indexes

CREATE NONCLUSTERED INDEX [ix_RecommendedIndex]
ON [dbo].[ReallyBigLog1]
([StartTime] , [EndTime])

CREATE NONCLUSTERED INDEX [IX_DateStamp]
ON [dbo].[ReallyBigLog2]
([DateStamp])

Execution Plan

5 SELECT            
    4 Compute Scalar        
        3 Merge Join  / Inner Join Merge:([dbo].[ReallyBigLog1].[Id] [T2]=[dbo].[ReallyBigLog1].[Id] [T1]), Residual:([dbo].[ReallyBigLog2].[Id] as [T2].[Id]=[dbo].[ReallyBigLog1].[Id] as [T1].[Id])  
            1 Clustered Index Scan Predicate:([dbo].[ReallyBigLog1].[StartTime] as [T1].[StartTime] IS NOT NULL AND [dbo].[ReallyBigLog1].[EndTime] as [T1].[EndTime] IS NOT NULL), ORDERED FORWARD [dbo].[ReallyBigLog1].[PK_dbo.ReallyBigLog1] [T1]
            2 Clustered Index Scan Predicate:([dbo].[ReallyBigLog2].[DateStamp] as [T2].[DateStamp]>='2017-05-16 00:00:00.000'), ORDERED FORWARD [dbo].[ReallyBigLog2].[PK_dbo.ReallyBigLog2] [T2]

EDIT (Tables Composition)

SELECT
  (SELECT COUNT(*) FROM ReallyBigLog1 WHERE StartTime IS NULL) as NullStartTime,
  (SELECT COUNT(*) FROM ReallyBigLog1 WHERE EndTime IS NULL) as NullEndTime,
  (SELECT COUNT(*) FROM ReallyBigLog1) as Log1Count,
  (SELECT COUNT(*) FROM ReallyBigLog2 WHERE DateStamp > '2017-5-16 00:00:00') AS DateStampUsage,
  (SELECT COUNT(*) FROM ReallyBigLog2) AS Log2Count

DateStampUsage  Log2Count   NullStartTime   NullEndTime  Log1Count
443038          651929      33748           34144        509545
James Z
  • 12,209
  • 10
  • 24
  • 44
SteveJ
  • 3,034
  • 2
  • 27
  • 47
  • 1
    Optimizer most likely estimates that scanning the table is cheaper, especially since merge join needs the data to be ordered. How much of the data matches the criteria `DateStamp >= '2017-5-16'`? How much of those have either of the times null? – James Z Jun 16 '17 at 20:33
  • Thank you James, I added that info as an Edit to the question. – SteveJ Jun 16 '17 at 20:46
  • looking at your execution plan, it looks like SQL server, is not using the new indexes at all, and hence it didn't improve any.. with the limited data that you have provided here, there is nothing we can solidly specify, saying that this article might help you a bit https://www.simple-talk.com/sql/performance/identifying-and-solving-index-scan-problems/ . – Surendra Jun 16 '17 at 20:46
  • What specifically are your clustered/primay indexes for the two tables? – hatchet - done with SOverflow Jun 16 '17 at 21:12
  • @hatchet; the Pk for both tables are Id. – SteveJ Jun 16 '17 at 21:39

3 Answers3

2

ix_RecommendedIndex will be of very poor help, unless you have a lots of nulls.

Here, the indexes which really matters are Ids and IX_DateStamp. Since you seem to have a lots of matching data in the WHERE clause, the optimiser prefers a clustered table scan (to merge the Ids).

One possibility to make it faster would be a CLUSTERED index on IX_DateStamp, but it will have performance side effects for other queries, and should be stressed on a test environment first.

If you can provide the EXPLAIN with statistics, it may help for a better diagnostic.

edit: With the statistics provided, I don't see how you can make it faster just with indexes. There are way too many data to parse (more than half of the two tables). You are hitting the point where you may need to consolidate your data appart, in another table, or optimize the data at the binary level (smaller record size for faster scans).

Guillaume F.
  • 5,905
  • 2
  • 31
  • 59
  • Thank you for taking the time to help, however, I don't understand this statement; "you may need to store counters appart". – SteveJ Jun 16 '17 at 20:54
  • Yes sorry it wasn't the best word, I basically meant you may need to consolidate your data into another table, for whatever algorithm you are trying to achieve here. – Guillaume F. Jun 16 '17 at 20:57
1

Since you're fetching most of the rows in the tables, the indexes have to be covering (=contain every column you need in your query from that table) to help you at all -- and that improvement might not be much.

The reason the indexes don't really help is that you're reading most of the rows, and you have IrreleventFields in your query. Since the index contains only the index key + clustered key, the rest of the fields must be fetched from the table (=clustered index) using the clustered index key. That's called key lookup and can be very costly, because it has to be done for every single row found from the index that matches your search criteria.

For the index being covered, you can add the "irrelevant" fields into include part of the index, if you want to try if it improves the situation.

James Z
  • 12,209
  • 10
  • 24
  • 44
  • James, that makes perfect sense - thank you. To be certain I understand you correctly, what you are saying is that even though I have an index - since my result set is almost the entire table, and I have fields not part of the index - I have to pull in almost the entire table anyway to get those extra fields? Curious, though, how would the optimizer know that ahead of time and decide not to use the index in the execution plan? – SteveJ Jun 16 '17 at 21:02
  • James, you are not going to believe this (at least I don't), I changed my query to only select DateStamp, StartTime, and EndTime. My query time jumped up to 5.4-seconds. It is slower now. That doesn't seem possible. – SteveJ Jun 16 '17 at 21:07
  • Adding to what James says, you might try indexing ReallyBigLog1 on just StartTime, and add as included columns all the other columns from ReallyBigLog1 that you want to reference from that table. Then run your original query on that. Also, ensure you have an index on ReallyBigLog2 on the foreign Id. – hatchet - done with SOverflow Jun 16 '17 at 21:07
  • 1
    The optimizer uses statistics to estimate how many rows it will need to read from the table, that's something you can see in the query plan. You can also force the optimizer to use your index with an index hint, and compare `statistics io` output to see how it really affects – James Z Jun 16 '17 at 21:11
  • Just a guess, but the fact that your query is slower, could be because it cannot do a merge join anymore, but looking at query plan should explain that too. – James Z Jun 16 '17 at 21:12
1

Having an index on the date and time alone is not going to help as much. You should have an index that covers conditions to your joins as well.. Such as the ID columns. Since your query is primarily quantifying on the time-stamp of the T2 alias, I would offer the following indexes

table           index
ReallyBigLog2   (DateStamp, ID )
ReallyBigLog1   (id, endTime, StartTime )

And here is why. You are specifically looking for transactions in T2 > a given date. So the really big log 2 STARTS with that as the basis. Then ALSO include the "ID" column for the JOIN basis to log table 1. Both parts of the index here are covered and do not require going to the data pages for comparison to get the fields yet.

Now, the columns index for T1. Start with the ID as an immediate found or not to the T2 table. Having the endTime, StartTime as part of the index, again, it does not have to go to the raw data pages to qualify the WHERE / JOIN criteria.

Once that is all done, it has the set of records, goes to the data pages for those and pulls the rest of the details you need.

from
   [dbo].[ReallyBigLog2] AS [T2]
      JOIN [dbo].[ReallyBigLog1] AS [T1]
         ON [T1].[Id] = [T2].[Id]
         AND ([T1].[EndTime] IS NOT NULL) 
         AND ([T1].[StartTime] IS NOT NULL) 
where
   [T2].[DateStamp] >= '2017-5-16 00:00:00'
DRapp
  • 47,638
  • 12
  • 72
  • 142
  • Thank you for the help - so, do I need to add ID into the compound index since it is already a PK? – SteveJ Jun 16 '17 at 21:39
  • 1
    @SteveJ, YES. Even though a PK, only one index is used at a time for query where/join conditions. This way your ID and date parts being considered are COVERING the index parts. – DRapp Jun 16 '17 at 21:44