Optimizing slow performing greatest-n-per-group query

Question

I have the following table:

id | query   | update_date | website_id | device  | page    | impressions | clicks | position | is_brand
---+---------+-------------+------------+---------+---------+-------------+--------+----------+---------
1  | kitchen | 2018-05-01  | 2          | desktop | http... | 11000       | 50     | 3        | 1
2  | table   | 2018-05-01  | 2          | desktop | http... | 7000        | 40     | 3        | 0
3  | kitchen | 2018-05-02  | 2          | desktop | http... | 11500       | 55     | 3        | 1
4  | table   | 2018-05-02  | 2          | desktop | http... | 7100        | 35     | 3        | 0

In this table I need a procedure that for each unique query gives me the best performing row in regards to clicks for a given time period. This resulted in the following procedure:

create or alter procedure get_best_website_querys
    @from as date,
    @to as date,
    @website_id as int
as
begin
    WITH    cte
          AS (SELECT    *
              ,         ROW_NUMBER() OVER (PARTITION BY query ORDER BY clicks DESC) RN
              FROM      search_console_query
              where 
                update_date >= @from and 
                update_date <= @to and 
                website_id = @website_id 
             )
    SELECT  cte.id
     ,      cte.query
     ,      cte.update_date
     ,      cte.website_id
     ,      cte.device
     ,      cte.page
     ,      cte.impressions
     ,      cte.clicks
     ,      cte.POSITION
     ,      cte.is_brand
    FROM    cte
    WHERE   RN = 1
end;

Now, this works and gives me the correct result. My problem is that this table grows quite large and this query performs rather slowly (> 3 minutes for a year). The query gives the following execution plan:

On the table I have a non-clustered index on clicks and a clustered one on (website_id, update_date).

I would like some input in regards to what would be the best approach to getting this to perform better. Any input would be appreciated.

Have you tried adding an index on `query, clicks` (in that order)? — TT., Jul 04 '18 at 11:16
I would try creating a non clustered index on `query` and another one on `clicks`. — Emilio Lucas Ceroleni, Jul 04 '18 at 11:20

Gordon Linoff · Answer 1 · 2018-07-04T17:11:07.777

First, try adding an index on search_console_query scq(website_id, update_date, query, clicks).

Then would suggest trying this version:

select scq.*
from search_console_query scq
where scq.update_date >= @from and 
      scq.update_date <= @to and 
      scq.website_id = @website_id and
      scq.clicks = (select max(scq2.clicks)
                    from search_console_query scq2
                    where scq2.website_id = scq.website_id and
                          scq2.query = scq.query and
                          scq2.update_date >= @from and
                          scq2.update_date <= @to
                    );

This version can take advantage of two indexes: search_console_query(website_id, query, update_date, clicks) and search_console_query(website_id, update_date, query, clicks).

This is slightly different, because it will return multiple rows for a query in the event of ties. If the performance improves significantly -- and this is an issue -- then that can be fixed.

EDIT:

The easiest way to remove duplicates for the second version is to assume that the table has a unique id column:

select scq.*
from search_console_query scq
where scq.update_date >= @from and 
      scq.update_date <= @to and 
      scq.website_id = @website_id and
      scq.sqc_id = (select top (1) sqc2.id
                    from search_console_query scq2
                    where scq2.website_id = scq.website_id and
                          scq2.query = scq.query and
                          scq2.update_date >= @from and
                          scq2.update_date <= @to
                    order by scq2.clicks desc);

Thank you for your input. I've made a clustered index ``(website_id, update_date, query, clicks)`` instead of the original one, as well as one on ``query`` and one one ``(query, clicks)``.. Your query however outperforms the original one by a lot. There is many duplicate rows however. Is this something that can be fixed without sacrificing too much performance? — Frederik Hansen, Jul 04 '18 at 12:08
@FrederikHansen see if your original query is able to use this index. — Salman A, Jul 04 '18 at 12:10
@SalmanA The execution plan says that it is using this index, but the performance still isn't very good. — Frederik Hansen, Jul 04 '18 at 12:13
@FrederikHansen . . . See the edit using `order by` and `top`. — Gordon Linoff, Jul 04 '18 at 17:11

score 2 · Answer 2 · answered Jul 04 '18 at 11:35

I would recommend using the indexes suggested above. Secondly, parameter sniffing could also be happening here. I would suggest that you re-declare the variables inside the stored procedure as follows so that parameter sniffing doesn't happen:

create or alter procedure get_best_website_querys    
    @from as date,
    @to as date,
    @website_id as int
as
begin
DECLARE @StartDate AS DATE = @from
       ,@EndDate AS DATE = @to
       ,@WebsiteID AS INT = @website_id

      WITH    cte
      AS (SELECT    *
          ,         ROW_NUMBER() OVER (PARTITION BY query ORDER BY clicks DESC) RN
          FROM      search_console_query
          where 
            update_date >= @StartDate and 
            update_date <= @EndDate and 
            website_id = @WebsiteID
         )
SELECT  cte.id
 ,      cte.query
 ,      cte.update_date
 ,      cte.website_id
 ,      cte.device
 ,      cte.page
 ,      cte.impressions
 ,      cte.clicks
 ,      cte.POSITION
 ,      cte.is_brand
FROM    cte
WHERE   RN = 1
end;

Salman A · Answer 3 · 2018-07-04T12:32:16.553

1

Seems like all of the columns in the select clause are indexable; you can try creating a massive covering index using included columns:

CREATE INDEX TEST_0001 ON search_console_query (
    website_id,
    update_date,
    query,
    clicks
) INCLUDE (
    id,
    device,
    page,
    impressions,
    position,
    is_brand
)

While you are at it, try more variations below and see which one is chosen by SQL server, then delete unnecessary ones:

changing the order of website_id, update_date
moving query, clicks inside included columns

edited Jul 04 '18 at 12:32

answered Jul 04 '18 at 12:19

Salman A

262,204
82
430
521

So I made the 3 indexes and it seems to prefer the last one ( with ``query, clicks`` inside include). Should I alter my clustered index in regards to this information? Also, my performance didn't increase by much. As I see it my only option is to upgrade my database. Is this a correct assumption? – Frederik Hansen Jul 04 '18 at 13:13
I would not alter the clustered index. Also SQL server should not hit the table and .fetch the results directly from the index (the execution plan should will tell). That I think should be as good as it gets. – Salman A Jul 04 '18 at 13:22

score 0 · Answer 4 · answered Jul 04 '18 at 12:20

An Alternative way to do this. But not sure about the performance, typically this pattern is used to find the latest record with non grouped columns in select list.

SELECT          a.id,
                a.query,
                a.update_date,
                a.website_id,
                a.device,
                a.page,
                a.impressions,
                a.clicks,
                a.POSITION,
                a.is_brand
FROM            search_console_query a
LEFT JOIN       search_console_query b  ON b.website_id = a.website_id
                                       AND a.query = b.query
                                       AND a.clicks > b.clicks
WHERE           update_date >= @from
                AND update_date <= @to
                AND website_id = @website_id
                AND b.clicks IS NULL

Optimizing slow performing greatest-n-per-group query

4 Answers4