3

This (modified for simplicity) query is part of a larger query, and joined on date with other selects. However I have pinned this section to be dog slow. Say I have a UserLoginHistory-table that logs each login for a user. For each user I want the date they first logged in. (Later on in the query, I group by LogDate to get how many first-time logins there were each day.)

select
    LogDate, --(this value is only date, no time)
    UserId
from
    UserLoginHistory ul
where
    not exists
        (
            select 
                * 
            from 
                UserLoginHistory ulPrevious
            where
                ulPrevious.LogDate < ul.LogDate
                and ul.UserId = ulPrevious.UserId
        )
group by ul.LogDate, ul.UserId

Obviously the NOT EXISTS-part is the slow one. But I can't figure out how to replace it by something more efficient doing the same job.

With a small UserLogHistory-count, performance is no problem. It's when I get to around 15 000 it starts getting slow. Maybe I should batch the result for each day into another table, but I'd like to find a better solution to this query as there should be one out there...

Thanks for your time!

cederlof
  • 7,206
  • 4
  • 45
  • 62
  • By definition, NOT EXISTS must perform a table scan. The key to tuning it's performance is to make the relation that must be scanned as small as possible, ideally a non-clustered index. Without knowing what indices exist on your table, it is not possible to give more specific advice. – Pieter Geerkens Mar 07 '13 at 15:52
  • 3
    Under most circumstances this is the best method between `not in` and `left outer join where key is null`: [Aaron Bertrand has tested it](http://www.sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join). – Nikola Markovinović Mar 07 '13 at 15:53
  • 2
    Performance questions tend to be very platform-specific, which database are you using: SQL Server or Sybase? And what indexes do you have on the table? 15,000 rows isn't very much, so it seems likely that your indexing may not be optimal. – Pondlife Mar 07 '13 at 16:56

2 Answers2

4

You can use a row numbering method:

select LogDate,UserId from (
    select
       LogDate, 
       UserId
       row_number() over (partition by UserId order by LogDate) as rown
    from
        UserLoginHistory ul
)
where rown = 1

Rows for each ID are numbered by LogDate, so the earliest one will always be numbered 1.

Note: I don't think the group by in your original query was necessary--the not exists clause should guarantee that you only get unique combinations of UserId and LogDate.

4

If these are the only 2 fields you are interested in could you not use a simple aggregate?

SELECT  LogDate = MIN(LogDate),
        UserID
FROM    UserLoginHistory
GROUP BY UserID;
GarethD
  • 68,045
  • 10
  • 83
  • 123
  • Thanks, that would probably work for a simple scenario, but I chose the dan1111's answer, as it would be more flexible for me. – cederlof Mar 08 '13 at 08:02