Sql Query Pervious Row Optimisation

Question

Here is my table structure

MyTable
-----------

ObjectID int (Identity),           -- Primary Key
FileName varchar(10),
CreatedDate datetime
...........
...........
...........

I need to get the time taken to create record in a file... ie... Time elapsed between the previous record in the same file and the current record of the same file

ie... If the records are

ObjectID    FileName    CreatedDate (just showing the time part here)
--------    --------    -----------
1           ABC         10:23
2           ABC         10:25
3           DEF         10:26
4           ABC         10:30
5           DEF         10:31
6           DEF         10:35

The required output is...

ObjectID    FileName    CreatedDate     PrevRowCreatedDate
--------    --------    ----------- ---------------
1           ABC         10:23           NULL
2           ABC         10:25           10:23
3           DEF         10:26           NULL
4           ABC         10:30           10:25
5           DEF         10:31           10:26
6           DEF         10:35           10:31

So far I got this query, but it is taking a very long time than expected... Is there a better way to do it...

    Select  A.ObjectID, 
        A.FileName
        A.CreatedDate as CreatedDate, 
        B.PrevRowCreatedDate,
        datediff("SS", '1900-01-01 00:00:00', Coalesce((A.CreatedDate - B.PrevRowCreatedDate),0)) as secondsTaken
    from MyTable as A 
        Cross Apply (       
        (Select PrevRowCreatedDate = Max(CreatedDate) from MyTable as BB 
                        where   BB.FileName = A.FileName and 
                                BB.CreatedDate < A.CreatedDate
        )
        ) as B

Please let me know incase you need more information

Thanks

Are you doing this for one row at a time, or for all rows in the table? Are you adding this as a new column, including it as a column in a returned set, or something else? Is the table small, large, or immense (in terms of kb/mb/gb)? — Philip Kelley, Aug 06 '10 at 14:28
Sucks that SQL Server doesn't yet support LEAD and LAG, especially now that they're ANSI... :/ — OMG Ponies, Aug 06 '10 at 14:35
@Martin, Yes the colums are indexed, If the query requires any other index, I can do that too... — The King, Aug 06 '10 at 14:52
@Philip, I need to calculate the time taken to process a particular file... But the catch is, if the time taken between two rows is more than 20 Mins, that has to be excluded from the calculation... (Strange request, I agree) — The King, Aug 06 '10 at 14:54
@The King - A composite index or you mean they are individually indexed? — Martin Smith, Aug 06 '10 at 14:55
@Philip. The table at present contains 150k records and is expanding @50k recs per month. (We are planning to archieve it after some time)... Further I have calculate the time component for any given date range... — The King, Aug 06 '10 at 18:02
@The King - I'd create this index `CREATE UNIQUE NONCLUSTERED INDEX [IX_MyTable] ON MyTable ([FILENAME] ASC,[CreatedDate] ASC)` and use Michael's solution. — Martin Smith, Aug 06 '10 at 18:11
Sure. Sorry, Now I'm at home, I will try this first thing at work on monday and will report the result. Thanks for all your help. — The King, Aug 06 '10 at 18:21

score 3 · Accepted Answer · edited Aug 06 '10 at 19:43

3

SELECT t1.FileName, t1.CreatedDate, t2.CreatedDate as PrevCreatedDate
FROM 
   (SELECT FileName, CreateDate,
          ROW_NUMBER() OVER(PARTITION BY FileName ORDER BY CreatedDate) AS OrderNo
   FROM MyTable) t1
LEFT JOIN
   (SELECT FileName, CreateDate,
     ROW_NUMBER() OVER(PARTITION BY FileName ORDER BY CreatedDate) AS OrderNo
     FROM MyTable) t2
ON (t1.FileName = t2.FileName AND t1.OrderNo = t2.OrderNo - 1)

Or may be better use 'WITH', because queries is identical:

WITH t(ObjectID, FileName, CreatedDate, OrderNo) AS
   (SELECT ObjectID, FileName, CreatedDate,
          ROW_NUMBER() OVER(PARTITION BY FileName ORDER BY CreatedDate) AS OrderNo
   FROM MyTable) 
SELECT t1.ObjectID, t1.FileName, t1.CreatedDate, t2.CreatedDate AS PrevCreatedDate,
        DATEDIFF("SS", '1900-01-01 00:00:00', 
           COALESCE((t1.CreatedDate - t2.CreatedDate),0)) AS secondsTaken
FROM t t1 LEFT JOIN t t2 
ON (t1.FileName = t2.FileName AND t1.OrderNo = t2.OrderNo + 1)

edited Aug 06 '10 at 19:43

Martin Smith

438,706
87
741
845

answered Aug 06 '10 at 14:28

Michael Pakhantsov

24,855
6
60
59

Worth trying this and looking at the plans but I'm not sure this will be any more efficient than what the OP already has. – Martin Smith Aug 06 '10 at 14:32
@Martin, unfortunately mssql have only several windowed functions. in oracle same query will look more elegantly: SELECT FileName, CreateDate, LAG(CreatedDate, 1, NULL) OVER(PARTITION BY FileName ORDER BY CreatedDate) AS PrevCreatedDate FROM MyTable. But LAG does not exists in mssql – Michael Pakhantsov Aug 06 '10 at 14:45
... though when I added a composite index on `FileName, CreatedDate` they both came out at the same cost. I'll do a bit more testing. – Martin Smith Aug 06 '10 at 15:14
@Martin Smith - I would imagine the amount of rows to process using the CROSS APPLY would quickly skyrocket. With given ROW_NUMBER solution and proper indexing, the amount of rows processed would roughly be n*2. For larger tables, I would think the ROW_NUMBER solution performs much better. OP Should try it (and post a query plan). – Lieven Keersmaekers Aug 06 '10 at 15:17
@Michael - That was the first one in that image. When I added the composite index on `FileName, CreatedDate` and a few thousands rows of test data though yours looks like it might be the slight winner though it's very close http://img715.imageshack.us/img715/3774/plansb.jpg – Martin Smith Aug 06 '10 at 15:31
@Lieven - Sorry I'm coming to realise that for some reason the figure that SQL Server reports as the "query cost (relative to the batch)" for these `row_number` queries is orders of magnitude different from the actual cost. For some reason it seems unable to realise that the join on rownum=rownum+1 will yield one row and this puts its cost estimates way off the actual. When I look at actual reads, writes, and cpu your's and Michael's are about the same and both massively better than the OPs original query. – Martin Smith Aug 06 '10 at 16:39
@Michael - I just made a couple of minor edits that I noticed when I was testing the two queries (I think the `-` should have been a `+` and I added in the additional column). Hope that's OK. – Martin Smith Aug 06 '10 at 18:16

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

I think Michael's answer should indeed prove more efficient. When evaluating efficiency though I just want to draw attention to an issue with the query costs (relative to the batch) shown in Management Studio.

I set up a test table with 23,174 rows and ran the query in the question and Michael's. Looking at the "query cost (relative to the batch)" in the actual execution plan the original query was 1% and Michael's 99% cost and so appears to be massively inefficient.

However the actual stats tell a completely different story

Cross Apply Approach

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.
Table 'MyTable'. Scan count 23175, logical reads 49335, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

ROW_NUMBER approach

SQL Server Execution Times: CPU time = 391 ms, elapsed time = 417 ms.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'MyTable'. Scan count 2, logical reads 148, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

In the ROW_NUMBER plan the Merge Join on rownumber=rownumber+1 has 23,174 rows going in to both sides. This value is unique and actual rows out is 23,174 as well. However SQL Server estimates that the rows produced from that join will be 34,812,000 and thus its estimated cost for the insert later in the plan is wildly inaccurate.

Test Script

BEGIN TRAN

CREATE TABLE MyTable
  (
     [ObjectID]    [INT] IDENTITY(1, 1) NOT NULL PRIMARY KEY CLUSTERED,
     [FileName]    [VARCHAR](50) NULL,
     [CreatedDate] [DATETIME] NULL
  )

GO

INSERT INTO MyTable
SELECT ISNULL(type, NEWID()),
       DATEADD(DAY, CAST(RAND(CAST(NEWID() AS VARBINARY)) * 10000 AS INT), GETDATE())
FROM   master.dbo.spt_values,
       (SELECT TOP 10 1 AS X FROM  master.dbo.spt_values) V


DELETE FROM MyTable
WHERE  EXISTS(SELECT *
              FROM   MyTable m2
              WHERE  MyTable.CreatedDate = m2.CreatedDate
                     AND MyTable.FileName = m2.FileName
                     AND MyTable.ObjectID < m2.ObjectID)

CREATE UNIQUE NONCLUSTERED INDEX [IX_MyTable]
  ON MyTable ([FileName] ASC, [CreatedDate] ASC)

SET STATISTICS IO ON
SET STATISTICS TIME ON

SELECT A.ObjectID,
       A.FileName,
       A.CreatedDate                                                                                AS CreatedDate,
       B.PrevRowCreatedDate,
       DATEDIFF("SS", '1900-01-01 00:00:00', COALESCE(( A.CreatedDate - B.PrevRowCreatedDate ), 0)) AS secondsTaken
INTO   #A
FROM   MyTable AS A
       CROSS APPLY ((SELECT PrevRowCreatedDate = MAX(CreatedDate)
                     FROM   MyTable AS BB
                     WHERE  BB.FileName = A.FileName
                            AND BB.CreatedDate < A.CreatedDate)) AS B;

WITH t(ObjectID, FileName, CreatedDate, OrderNo)
     AS (SELECT ObjectID,
                FileName,
                CreatedDate,
                RANK() OVER(PARTITION BY FileName ORDER BY CreatedDate) AS OrderNo
         FROM   MyTable)
SELECT t1.ObjectID,
       t1.FileName,
       t1.CreatedDate,
       t2.CreatedDate                                                                          AS PrevCreatedDate,
       DATEDIFF("SS", '1900-01-01 00:00:00', COALESCE(( t1.CreatedDate - t2.CreatedDate ), 0)) AS secondsTaken
INTO   #B
FROM   t t1
       LEFT JOIN t t2
         ON ( t1.FileName = t2.FileName
              AND t1.OrderNo = t2.OrderNo + 1 )

/*Test the 2 queries give the same result*/
SELECT *
FROM   #A
EXCEPT
SELECT *
FROM   #B

SELECT *
FROM   #B
EXCEPT
SELECT *
FROM   #A

ROLLBACK

Sql Query Pervious Row Optimisation

2 Answers2

Cross Apply Approach

ROW_NUMBER approach

Test Script

Linked