Selecting all rows with 10 distinct values

Question

I've got a very large (100,000,000+ rows), poorly performing table in the following format

TABLE [dbo].[Emails](
    [pkid] [int] IDENTITY(1,1) NOT NULL,
    [cid] [int] NULL,
    [email] [nvarchar](255) NULL,
    [lid] [int] NULL,
    [date] [smalldatetime] NULL
)

[There's a clustered index on pkid, a Unique non-clustered index on cid+email+lid, and a non-unique non-clustered index on cid alone]

For a given cid, there can be several entries with the same email, and different lid values.

cid, email,             lid, date
123, mal@serenity.fake, 456, 2014-07-17 12:21:00
123, mal@serenity.fake, 459, 2014-07-17 12:26:00
123, mal@serenity.fake, 466, 2014-07-17 12:27:00
123, zoe@serenity.fake, 456, 2014-07-17 12:21:00
123, zoe@serenity.fake, 467, 2014-07-17 12:28:00

For a given cid, I want to extract the first 10 unique emails, and concatenate the rest of the data into a list of lids and either a list of dates or a min date field. I would also like to return the pkid of the last data row selected, for paging

so for example

mal@serenity.fake, [list of lids], [list of dates], [last pkid for all rows]
zoe@serenity.fake, [list of lids], [list of dates], [last pkid for all rows] 
wash@serenity.fake, [list of lids], [list of dates], [last pkid for all rows] 
jane@serenity.fake, [list of lids], [list of dates], [last pkid for all rows] 
kaylee@serenity.fake, [list of lids], [list of dates], [last pkid for all rows] 
inara@serenity.fake, [list of lids], [list of dates], [last pkid for all rows] 
book@serenity.fake, [list of lids], [list of dates], [last pkid for all rows] 
simon@serenity.fake, [list of lids], [list of dates], [last pkid for all rows] 
river@serenity.fake, [list of lids], [list of dates], [last pkid for all rows] 
early@intruder.fake, [list of lids], [list of dates], [last pkid for all rows]

What is the most efficient way of doing this? The way this is currently implemented seems horribly inefficient and I'd like to improve it:

SELECT TOP (@rows) * FROM ( 
    SELECT TOP (@rows) * FROM ( 
        SELECT TOP (@rows * @page) 
             email,
             MIN([date]) as [date],
             COUNT(lid) as [lids],
            [ids] = STUFF((SELECT ',' + CAST(lid AS VARCHAR(8)) + ';' + CAST(date as VARCHAR(20)) FROM Emails WHERE cid = @cid AND email = Z.email FOR XML PATH('')), 1, 1, '')
        FROM Emails Z (NOLOCK) 
        WHERE cid = @cid GROUP BY email ORDER by MIN([date]) DESC 
    ) as [a] ORDER BY [date] 
) [b] ORDER BY [date] DESC

As you can see, if I'm trying to get to page 5 of these results, it's selecting all 5 pages (50 emails, which could be 500+ rows of data), and then the 10 from the bottom. There has to be a better way to do this!

Looks like you might need to use `ROW_NUMBER()` with an `OVER(PARTITION BY ...)`: http://stackoverflow.com/questions/16303343/t-sql-to-find-top-10-records-for-each-group/16303975#16303975 — valverij, Jul 17 '14 at 12:12
You could try to include the date column in your unique index (as an include column) to avoid a key lookup if that index is to be used. You could also try to move the generation of ids column to the main query so you are only generating for the 10 emails you are actually returning. — Mikael Eriksson, Jul 17 '14 at 12:55
Are you aware of all the challenges NOLOCK brings to the table? Is duplicate and/or missing rows ok? If you are going to use it you need to change your syntax and use the WITH keyword. Table hints without the WITH keyword are deprecated and will no longer work at some point. — Sean Lange, Jul 17 '14 at 13:23
@MikaelEriksson updating the indexes is not an option I'm afraid. — roryok, Jul 17 '14 at 13:25
@SeanLange I'm aware of the missing rows issues with NOLOCK, but the speed benefits far outweigh the occasional missing result. A select on a given cid takes 1 second using NOLOCK and times out without it — roryok, Jul 17 '14 at 13:26
Understand about the speed, especially on a table this large. I would recommend setting your isolation level in your procedure instead of littering that hint all over the place. That way if you decide to change at some point you just remove the one line and you don't have to add the hint to every table (which btw you don't have this hint in your xml query). SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED — Sean Lange, Jul 17 '14 at 13:38
Actually, forget what I said before. All options are on the table. If I can get a decent performance boost by changing indexes then I'll do it. The only issue is that it takes hours and hours and altering indexes sometimes times out on me — roryok, Jul 17 '14 at 13:38
I'm seeing the big picture now. In order for this query to work it really has to do a scan each time. I think I'll just have to put up with poor performance. Thanks guys. — roryok, Jul 24 '14 at 09:59

score 0 · Answer 1 · answered Jul 17 '14 at 13:51

What about something like this? I didn't put together a million row test table for this but the execution plan is a LOT simpler. Building the CSV is going to hurt because it is going to force a second table scan no matter what you do.

with MyRows as
(
    SELECT TOP (@rows * @page) 
        email,
        MIN([date]) as [date],
        COUNT(lid) as [lids],
        [ids] = STUFF(
            (
                SELECT ',' + CAST(lid AS VARCHAR(8)) + ';' + CAST([date] as VARCHAR(20))
                FROM Emails WITH (NOLOCK)
                WHERE cid = @cid 
                AND email = Z.email 
                FOR XML PATH('')
            ), 1, 1, '')
        , ROW_NUMBER() over(PARTITION BY email order by MIN([date]) desc) as RowNum
        FROM Emails Z WITH (NOLOCK) 
        WHERE cid = @cid 
        GROUP BY email 
        ORDER by MIN([date]) DESC 
)

select email
    , [date]
    , lids
    , ids
from MyRows
where RowNum >= @rows * @page 
and RowNum < @rows * (@page + 1)

Selecting all rows with 10 distinct values

1 Answers1