2

The data I am working with

Consider the following 2 database tables:

CREATE TABLE [dbo].[Contact](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [Contact_UID] [uniqueidentifier] NOT NULL CONSTRAINT [DF_Contact_Contact_UID]  DEFAULT (newsequentialid()),
    [Name] [nvarchar](255) NOT NULL,
    [ContactStatus] [nvarchar](255) NOT NULL)

CREATE TABLE [dbo].[Contact_Log](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [LogDate] [datetimeoffset](7) NOT NULL CONSTRAINT [DF_Contact_Log_LogDate]  DEFAULT (sysdatetimeoffset()),
    [Contact_UID] [uniqueidentifier] NOT NULL CONSTRAINT [DF_Contact_Log_Contact_UID]  DEFAULT (newsequentialid()),
    [Name] [nvarchar](255) NOT NULL,
    [ContactStatus] [nvarchar](255) NOT NULL)

The Contact table is the main table for contact records. It stores the name of the contact and the status (e.g. "Alive", "Dead" or whatever).

The Contact_Log table stores all changes made to the Contact table.

So here is some sample data:

Contact:

+----+--------------------------------------+------+---------------+
| ID | Contact_UID                          | Name | ContactStatus |
+----+--------------------------------------+------+---------------+
| 1  | 62918AC1-1C6C-4DEB-B7F8-5D5EF913F667 | John | Dead          |
+----+--------------------------------------+------+---------------+
| 2  | F7844037-2FF5-47B9-874D-C0920E7DC092 | Jane | Alive         |
+----+--------------------------------------+------+---------------+

Contact_Log:

+----+--------------------------------------+------+---------------+------------+
| ID | Contact_UID                          | Name | ContactStatus | LogDate    |
+----+--------------------------------------+------+---------------+------------+
| 1  | 62918AC1-1C6C-4DEB-B7F8-5D5EF913F667 | John | Alive         | 2019-01-01 |
+----+--------------------------------------+------+---------------+------------+
| 2  | 62918AC1-1C6C-4DEB-B7F8-5D5EF913F667 | John | Dead          | 2019-01-02 |
+----+--------------------------------------+------+---------------+------------+
| 3  | 62918AC1-1C6C-4DEB-B7F8-5D5EF913F667 | John | Dead          | 2019-01-03 |
+----+--------------------------------------+------+---------------+------------+
| 4  | F7844037-2FF5-47B9-874D-C0920E7DC092 | Jane | Alive         | 2019-01-04 |
+----+--------------------------------------+------+---------------+------------+

NOTE: At this point I haven't added any indexes or anything like that on that tables.

Test Scenario

The above is just some sample data. The data I am testing on has the following row counts:

Contact: ~10,000 rows

Contact_Log: ~3,000,000 rows

I am currently using SQL Server 2008 R2 for testing. So a solution that is supported in that and later is preferred.

What I am trying to achieve

Basically I am trying to formulate a query that can tell me the LogDate for when the ContactStatus field was last changed, for a specific Contact_UID, taken from the Contact_Log table.

For example, if the record I was interested in was "John", then the result should be "2019-01-02". As this is the date that John's ContactStatus last changed (i.e. it changed from "Alive" to "Dead").

Ultimately, I want to put this query into a function. A function that can be called by passing in the Contact_UID and the name of the field I want to check. This function could then be called as part of a more general query. For example:

SELECT Name, MyFunction('62918AC1-1C6C-4DEB-B7F8-5D5EF913F667', 'ContactStatus') AS StatusLastChanged FROM Contact

What I have tried so far

Well, I have tried a few things, and although I can get the results I want. My attempts are really struggling with performance problems.

NOTE: Although I only really want a single datetimeoffset result. Some attempts include more data/fields simply to try and validate the data is accurate.

Attempt 1:

SELECT TOP(1) a.LogDate
FROM Contact_Log AS a
WHERE a.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667' 
AND a.ContactStatus <>
(
SELECT TOP(1) b.ContactStatus
FROM Contact_Log AS b
WHERE b.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
AND a.LogDate > b.LogDate
ORDER BY b.LogDate DESC
)
ORDER BY LogDate DESC

Problem 1: Too slow. I had to stop the query after nearly an hour of waiting with no results.

Attempt 2:

SELECT A.LogDate
FROM (SELECT ROW_NUMBER() OVER (ORDER BY LogDate DESC, ID DESC) AS rnum, ID, LogDate, Contact_UID, ContactStatus FROM Contact_Log) A
LEFT JOIN (SELECT ROW_NUMBER() OVER (ORDER BY LogDate DESC, ID DESC) AS rnum, ID, LogDate, Contact_UID, ContactStatus FROM Contact_Log) B
ON A.rnum = B.rnum-1
WHERE 
(B.rnum IS NULL
OR (A.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667' 
AND B.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
AND A.ContactStatus != B.ContactStatus))
ORDER BY A.rnum

Problem 2: This works and gives me the correct data set. However it takes 6 seconds which is too slow. Remember it will need to work as a function in a more general query (with ~10,000 rows).

Attempt 3: Now this is basically the same as attempt 2, expect I tried to apply TOP(1) so that I can get the result that I actually want.

SELECT TOP(1) A.LogDate
FROM (SELECT ROW_NUMBER() OVER (ORDER BY LogDate DESC, ID DESC) AS rnum, ID, LogDate, Contact_UID, ContactStatus FROM Contact_Log) A
LEFT JOIN (SELECT ROW_NUMBER() OVER (ORDER BY LogDate DESC, ID DESC) AS rnum, ID, LogDate, Contact_UID, ContactStatus FROM Contact_Log) B
ON A.rnum = B.rnum-1
WHERE 
(B.rnum IS NULL
OR (A.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667' 
AND B.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
AND A.ContactStatus != B.ContactStatus))
ORDER BY A.rnum

Problem 3: To my surprise, this takes a lot longer than attempt 2, even though all I did was add TOP(1) at the start. This took over 5 minutes so I stopped the query and gave up.

Question

How can I do what I want in "What I am trying to achieve", but with a reasonable amount of performance? (I would be happy getting it under 1 second at this stage).

Remember, I just want a single datetimeoffset as the result so it can be used in a function.

So far I have no specific indexes created. I am happy to consider suggestions for those as a suitable answer, if no improvement to the query are possible. Or any changes to the schema that is appropriate.

Bottom Line

I am looking for a query that will produce 1 result, with 1 datetimeoffset field. It needs to take less than 1 second to run.

Salman A
  • 262,204
  • 82
  • 430
  • 521
musefan
  • 47,875
  • 21
  • 135
  • 185
  • What is `MyFunction`? Are you therefore using SQL Server 2019? – Thom A Nov 28 '19 at 14:41
  • `MyFunction` will be the function that I eventually create that returns the result that I am trying to get. I am using SQL Server 2008 R2 for testing. – musefan Nov 28 '19 at 14:43
  • Does it have to be for exactly one id? Or do you want dates for several records in one query? – Salman A Nov 28 '19 at 14:44
  • @SalmanA: The function with take a Contact_UID (uniqueidentifier) and return a single datetimeoffset result. – musefan Nov 28 '19 at 14:46
  • What index definitons are on the tables? – Adwaenyth Nov 28 '19 at 14:49
  • Scalar User Defined functions are notoriously slow, @musefan (unless using inlining in 2019). Also, 2008 is completely unsupported and doesn't (if I recall correctly) support TVfs. I would get rid of the function to start, imo. – Thom A Nov 28 '19 at 14:49
  • @Adwaenyth: None other than PK, as stated in the question. – musefan Nov 28 '19 at 14:50
  • @Larnu: Thanks for the advice, I will look into it. It is possible that we can enforce 2012 or later as requirements. For my question though I would like to focus on just getting a query that outputs a single datetimeoffset. The function is more a long term goal that I thought worth mentioning... that problem can come later down the line. I just need a good query first. – musefan Nov 28 '19 at 14:52
  • @Larnu: If you have any suggestions for an alternate approach to the problem then please comment and I will look into it. Thanks. – musefan Nov 28 '19 at 14:55
  • Is it possible to have two changes on same date? – Salman A Nov 28 '19 at 15:07
  • @SalmanA: The data type is datetimeoffset(7) so will have millisecond precision. Although technically possible to have two changes at the same moment. It is fine to assume it won't happen for the purposes of this question – musefan Nov 28 '19 at 15:10

2 Answers2

1

You want to select the smallest date right after the highest date that does not equal the current ContactStatus. That would be something like this:

select
min(LogDate)
from Contact_Log
where
Contact_UID='62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
and ContactStatus = (
        select top 1
        ContactStatus
        from Contact_Log where
        Contact_UID='62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
        order by Log_Date desc
        )
and LogDate > (
    select max(LogDate)
    from Contact_Log
    where Contact_UID='62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
    and ContactStatus != (
        select top 1
        ContactStatus
        from Contact_Log where
        Contact_UID='62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
        order by Log_Date desc
        )
    );
verhie
  • 1,298
  • 1
  • 7
  • 7
  • I need to try and achieve this without using the Contact table, if possible. – musefan Nov 28 '19 at 15:11
  • I also won't have anyway to know what the current value is. It should be determined from the Contact_Log table what the current value is. Having said that I tested your query and it is fast. I just won't know the current status to use in the query – musefan Nov 28 '19 at 15:14
  • sorry, sql-server is not my default SQL – verhie Nov 28 '19 at 15:30
  • Thanks. It seems to work, as does the other answer. I will do more testing and award correct answer to whichever is fastest. – musefan Nov 28 '19 at 15:46
1

This is based on your original attempt:

SELECT ca3.LogDate
FROM (
    -- find last status
    SELECT TOP 1 *
    FROM Contact_Log
    WHERE Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
    ORDER BY LogDate DESC, ID DESC
) AS ca1
CROSS APPLY (
    -- find date when status changed
    SELECT TOP 1 *
    FROM Contact_Log
    WHERE Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
    AND ContactStatus <> ca1.ContactStatus
    ORDER BY LogDate DESC, ID DESC
) AS ca2
CROSS APPLY (
    -- find next date
    SELECT TOP 1 *
    FROM Contact_Log
    WHERE Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
    AND (LogDate = ca2.LogDate AND ID > ca2.ID
         OR LogDate > ca2.LogDate)
    ORDER BY LogDate, ID
) AS ca3

This query should benefit from an index consisting of Contact_UID, LogDate, ID, Status

Salman A
  • 262,204
  • 82
  • 430
  • 521
  • Thanks. I seem to be getting the wrong result, but it's strangely close. As you predicted my test data does have duplicate dates. I tried adding the ID in but I get the same results. Could you edit answer to include solution that works for duplicate dates too. That way I can check I am testing correctly as it's probably just me – musefan Nov 28 '19 at 15:33
  • Don't worry. I edited test data do avoid duplicate dates.and your solution works great. FYI I think the problem is when trying to "find next date" with duplicate dates as the `LogDate > ca2.LogDate` part can end up grabbing values from before the change too (as they have same time). In practice it's not going to happen though so all is good. – musefan Nov 28 '19 at 15:43
  • Other answer seems to be faster though. I will upvote both answers now as correct, and will award the fastest one after I have done some more thorough testing. Let me know if you can speed up your one ;) – musefan Nov 28 '19 at 15:45
  • See revised answer, it is from top of my head and I did not test for edge cases. Index should help with speed, this and the other query. – Salman A Nov 28 '19 at 17:20