3

My database teacher asked me to write (on Oracle Server) a query: select the groupid with the highest score average for year 2010

I wrote:

SELECT * FROM (
    SELECT groupid, AVG(score) average FROM points
    WHERE yr = 2010
    AND score IS NOT NULL
    GROUP BY groupid
    ORDER BY average DESC
) WHERE rownum = 1;

My teacher tells me that this request is "better":

SELECT groupid, AVG(score) average FROM points
WHERE yr = 2010
GROUP BY groupid
HAVING AVG(score) >= ALL (
    SELECT AVG(score) FROM points
    WHERE yr = 2010
    GROUP BY groupid
);

Which one is the fastest/better ? Is there any better solution too (for Oracle only) ? Thanks.

Maxux
  • 199
  • 1
  • 1
  • 11

3 Answers3

3

There are two reasons your instructor is telling you that.

  1. Data model. Relational DBMSs deal with sets, not lists. If you are learning SQL, it is better for you to think in terms of sets of tuples, that are unordered, than in order lists. You will be better at understanding how to query the DBMS. I consider your solution a hack: one that works, partially, since--as Perun_x has pointed out--- it does not work if more than one tuple match the result. It is contrary to the data model and spirit of SQL).

  2. Portability. This is the real killer. Your code will work on Oracle but not in other DBMSs that do not support the row_number attribute (each has its own way to do it).

--dmg

dmg
  • 4,231
  • 1
  • 18
  • 24
  • Thanks. Indeed, I don't have "set of tuples" point of view. But the educationnal way next to the performance way is not for me, really interresting. For the portability point, it's totally true. I would prefere use LIMIT inside of rownum, wich is not portable too, I know. In the other side, we are learning Oracle specific code too... – Maxux Apr 28 '13 at 18:51
2

The queries aren't equivalent. The 1st query always selects 1 row. The second one selects all rows with the highest average (there can theoretically be more such lines).

Perun_x
  • 162
  • 2
  • 10
2

I happen to prefer your version, assuming that one row is sufficient for what you need. My problem with the teacher's version is primarily readability. I find it hard to parse.

Your version is essentially saying "Order the groups by their average and take the one with the highest average." The teachers version is essentially saying: "Find the average that is greater than or equal to any of the averages of the groups". This may be subjective, but I find the former more understandable than the latter.

As for which is faster. Yours needs to do an aggregation and a sort to get the best value. The second version needs to do two aggregations and a join. I would think that the aggregation/sort method would be faster, but the only way to really know is by checking the performance on a particular system configuration and data set.

An alternative formulation, that should be roughly equivalent to yours in terms of performance, is:

select groupid, avgscore
from (select groupid, avg(score) as avgscore,
             row_number() over (order by avg(score) desc) as seqnum
     from points
     where yr = 2010
     group by groupid
    ) t
where seqnum = 1

The advantage here is that you can change the row_number() to dense_rank() to get either 1 of the best rows or all of the best rows.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks. I tried some method (tried a WITH statement too), and yes, the HAVING method is dramatically slower. Your alternative formulation is interresting. – Maxux Apr 28 '13 at 18:45