Performance with NOT EXISTS - t-sql query

Question

This (modified for simplicity) query is part of a larger query, and joined on date with other selects. However I have pinned this section to be dog slow. Say I have a UserLoginHistory-table that logs each login for a user. For each user I want the date they first logged in. (Later on in the query, I group by LogDate to get how many first-time logins there were each day.)

select
    LogDate, --(this value is only date, no time)
    UserId
from
    UserLoginHistory ul
where
    not exists
        (
            select 
                * 
            from 
                UserLoginHistory ulPrevious
            where
                ulPrevious.LogDate < ul.LogDate
                and ul.UserId = ulPrevious.UserId
        )
group by ul.LogDate, ul.UserId

Obviously the NOT EXISTS-part is the slow one. But I can't figure out how to replace it by something more efficient doing the same job.

With a small UserLogHistory-count, performance is no problem. It's when I get to around 15 000 it starts getting slow. Maybe I should batch the result for each day into another table, but I'd like to find a better solution to this query as there should be one out there...

Thanks for your time!

By definition, NOT EXISTS must perform a table scan. The key to tuning it's performance is to make the relation that must be scanned as small as possible, ideally a non-clustered index. Without knowing what indices exist on your table, it is not possible to give more specific advice. — Pieter Geerkens, Mar 07 '13 at 15:52
Under most circumstances this is the best method between `not in` and `left outer join where key is null`: [Aaron Bertrand has tested it](http://www.sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join). — Nikola Markovinović, Mar 07 '13 at 15:53
Performance questions tend to be very platform-specific, which database are you using: SQL Server or Sybase? And what indexes do you have on the table? 15,000 rows isn't very much, so it seems likely that your indexing may not be optimal. — Pondlife, Mar 07 '13 at 16:56

score 4 · Accepted Answer · answered Mar 07 '13 at 15:53

You can use a row numbering method:

select LogDate,UserId from (
    select
       LogDate, 
       UserId
       row_number() over (partition by UserId order by LogDate) as rown
    from
        UserLoginHistory ul
)
where rown = 1

Rows for each ID are numbered by LogDate, so the earliest one will always be numbered 1.

Note: I don't think the group by in your original query was necessary--the not exists clause should guarantee that you only get unique combinations of UserId and LogDate.

Now that did a world of difference. – cederlof Mar 08 '13 at 08:00 — cederlof, Mar 08 '13 at 08:00

score 4 · Answer 2 · answered Mar 07 '13 at 16:03

4

If these are the only 2 fields you are interested in could you not use a simple aggregate?

SELECT  LogDate = MIN(LogDate),
        UserID
FROM    UserLoginHistory
GROUP BY UserID;

answered Mar 07 '13 at 16:03

GarethD

68,045
10
83
123

Thanks, that would probably work for a simple scenario, but I chose the dan1111's answer, as it would be more flexible for me. – cederlof Mar 08 '13 at 08:02

Performance with NOT EXISTS - t-sql query

2 Answers2