Performance issue on selecting n newest rows in subselect

Question

I have a database with courses. Each course contains a set of nodes, and some nodes contains a set of answers from students. The Answer table looks (simplified) like this:

Answer

id  | courseId |  nodeId |  answer
------------------------------------------------
 1  |   1      |   1     |  <- text ->
 2  |   2      |   2     |  <- text ->
 3  |   1      |   1     |  <- text ->
 4  |   1      |   3     |  <- text ->
 5  |   2      |   2     |  <- text ->
..  |  ..      |   ..    |  ..

When a teacher opens a course (i.e. courseId = 1) I want to pick the node that have received the most answers lately. I can do this using the following query:

with Answers as
(
   select top 50 id, nodeId from Answer A where courseId=1 order by id desc
)
select top 1 nodeId from Answers group by nodeId order by count(id) desc

or equally using this query:

select top 1 nodeId from 
    (select top 50 id, nodeId from Answer A where courseId=1 order by id desc)
    group by nodeId order by count(id) desc

In both querys the newest 50 answers (with the highest ids) are selected and then grouped by nodeId so I can pick the one with the highest frequency. My problem is, however, that the query is very slow. If I only run the subselect, it takes less than a second, and grouping 50 rows should be fast, but when I run the entire query it takes about 10 seconds! My guess is that sql server does the select and grouping first, and afterwards does the top 50 and top 1, which in this case leads to terrible performance.

So, how can I rewrite the query to be efficient?

What indexes do you have on that table? I'd recommend having one compound index on `(course_id, id)`. As for your assumption about what sql server is doing, test that empirically by getting the execution plans and comparing them. You may find that the existence of the outer query changes the plan for the inner part of the query. You may also find that adding the index mentioned changes the plan completely. — MatBailie, Aug 19 '15 at 11:24

score 2 · Accepted Answer · answered Aug 19 '15 at 11:11

2

You can add indexes to make your queries more efficient. For this query:

with Answers as (
      select top 50 id, nodeId
      from Answer A
      where courseId = 1
      order by id desc
     )
select top 1 nodeId
from Answers
group by nodeId
order by count(id) desc;

The best index is Answer(courseId, id, nodeid).

answered Aug 19 '15 at 11:11

Gordon Linoff

1,242,037
58
646
786

Yes of course, but that will actually not answer my question. Since the inner query takes less than a second, and the entire query multiplies that time by more than 10, the query obviously leads to a lot of overhead under the hood, and as far as I know indexes won't fix that. Moreover we want to use as few indexes as possible since the table is written to all the time. – Bjørn Ove Thue Aug 19 '15 at 11:24
1

@BjørnOveThue - you probably should test your assumptions before asserting your conclusions. SQL Server doesn't generate an execution plan for the inner query independently of the outer query, nor does it generate execution plans independently of the indexes. Put the other way; adding the outer query can change the execution plan, and adding the indexes can change the execution plan again. I'd suggest trying this solution, including the index, and analysing the execution plans that you get; you may yet find that this works for you. – MatBailie Aug 19 '15 at 11:29
OK! Adding the index seemed to change the execution plan as well. Problem solved! :) – Bjørn Ove Thue Aug 19 '15 at 12:52

score 0 · Answer 2 · answered Aug 19 '15 at 11:52

To be more insightful we'd need to see the indexes on that table and the execution plans you're getting (one plan for the inner query on it's own, one plan for the full query).

I'd even recommend doing the same analysis again having added the index mentioned elsewhere on this page.

Without that information the only things we can recommend are trial and error.

For example, try avoiding using TOP (this shouldn't matter, but we're guessing while we can't see your indexes and execution plans)

WITH
    Answers AS
(
    SELECT
        ROW_NUMBER() OVER (ORDER BY id DESC)   AS rowId,
        id,
        nodeId
    FROM
        Answer
    WHERE
        courseId = 1
),
    top50 AS
(
    SELECT
        nodeId,
        COUNT(*)   AS row_count
    FROM
        Answers
    WHERE
        rowId <= 50
    GROUP BY
        nodeId
),
    ranked AS
(
    SELECT
        ROW_NUMBER() OVER (ORDER BY row_count DESC, nodeId DESC)  AS ordinal,
        nodeID
    FROM
        top50
)
SELECT
    nodeID
FROM
    ranked
WHERE
    oridinal = 1

Which is massively over the top, but functionally the same as you have in your OP, but sufficiently different to potentially get a different execution plan.

Alternatively (and not very nice), just put the results of your inner query in to a table variable, then run the outer query on the table variable.

I still expect, however, that adding the index will be the least-worst option.

OK! Adding the index seemed to change the execution plan as well. Problem solved! :) — Bjørn Ove Thue, Aug 19 '15 at 12:52

Performance issue on selecting n newest rows in subselect

2 Answers2