Select minimal values from two different columns

Question

I have to do a SQL statement to select for each day the min value from a column and when the value Order is the same the Max value from percentage.

Example:

Date           Order            Percentage           
-------------------------------------------
01-03-2016      1                   0 
01-03-2016      2                   20
02-03-2016      1                   0
02-03-2016      2                   20
03-03-2016      2                   50
03-03-2016      2                   20

The result that I want is something like:

Date           Order            Percentage           
-------------------------------------------
01-03-2016      1                   0 
02-03-2016      1                   0
03-03-2016      2                   50

Is a situation like `'04-03-2016' 1 10, '04-03-2016' 2 0` also possible? — Giorgos Betsos, Mar 23 '16 at 19:08
This has a lot of upvotes for a code request that doesn't even ask a question — Sterling Archer, Mar 23 '16 at 19:09

score 4 · Answer 1 · answered Mar 23 '16 at 18:58

4

You could use row_number to sort the rows within each such group, and take the first one per group:

SELECT [Date], [Order], [Percentage]
FROM   (SELECT [Date], [Order], [Percentage],
               ROW_NUMBER() OVER (PARTITION BY [Date]
                                  ORDER BY [Order] ASC, [Percentage] DESC) AS rk
        FROM   mytable) t
WHERE  rk = 1

answered Mar 23 '16 at 18:58

Mureinik

297,002
52
306
350

Thanks Mureinik, your answer is exactly what I want – JohnSmith82 Mar 28 '16 at 16:51

score 3 · Answer 2 · edited May 23 '17 at 12:07

3

If you're using SQL Server 2012+, you could use the following solution:

SELECT DISTINCT
  [Date],
  FIRST_VALUE ([Order])   OVER (PARTITION BY [Date] ORDER BY [Order] ASC, [Percent] DESC),
  FIRST_VALUE ([Percent]) OVER (PARTITION BY [Date] ORDER BY [Order] ASC, [Percent] DESC)
FROM (
  VALUES('2016-03-01', 1, 0),
        ('2016-03-01', 2, 20),
        ('2016-03-02', 1, 0),
        ('2016-03-02', 2, 20),
        ('2016-03-03', 2, 50),
        ('2016-03-03', 2, 20)
) AS t([Date], [Order], [Percent])

How does it work? For each partition (i.e. "group") we're selecting the first value, ordered by [Order]. If two first values for [Order] are the same, then order by [Percent] descendingly. I.e. pretty much the requirement from your question.

Because the first value is the same for the entire partition, we can use DISTINCT to remove duplicates, afterwards.

A note on performance:

Be cautious with this solution, especially on SQL Server. A ROW_NUMBER() based solution as suggested here will outperform mine, slightly on Oracle, and drastically on SQL Server (see comments)

edited May 23 '17 at 12:07

Community

1
1

answered Mar 23 '16 at 18:59

Lukas Eder

211,314
129
689
1,509

1

While it is a clean solution, SQL Server is unfortunately not (yet) too smart about this type of query and will end up with a fairly ugly query plan (seel also [this blog post](http://sqlblog.com/blogs/rob_farley/archive/2012/04/10/analytic-functions-they-re-not-aggregates.aspx)). That's why I would go with Mureinik's solution which has only one simple window function (and it works with other SQL Server versions as well). – Lucero Mar 23 '16 at 19:13
As the article states. It depends on whether `DISTINCT` will remove many rows. – Lukas Eder Mar 23 '16 at 19:30
The point is that the `rk=1` check is a simple filter, no distinct computation necessary at all. ;) – Lucero Mar 23 '16 at 20:46
A filter across the entire data set is `O(N)`. `DISTINCT` can be implemented with a hashmap, i.e. `O(N)`, too. Your turn. – Lukas Eder Mar 23 '16 at 22:22
Sure. As noted before, the filter-based solution only uses one window function, which reduces the computation time. Then we don't need any storage of hashes or similar for the filter - and the operations is also simpler to perform for a single item. This eliminates memory/storage requirements and reduces computation time. Agreed? – Lucero Mar 23 '16 at 22:26
There is "storage" for the filter. The row number column needs to be stored (unless SQL Server can push the predicate down into the window function?). OK, fixed my answer to use only a single `OVER()` clause. Now, there's only one remaining window. So, possibly, the two solutions (depending on the implementation) can be implemented in roughly the same way. Anyway, the other answer is more reasonable, but I like to encourage people to think outside the box – Lukas Eder Mar 23 '16 at 22:35
Be my guest; your version is about 60 times slower on the sample I created: http://sqlfiddle.com/#!6/707e5/1/0 vs. http://sqlfiddle.com/#!6/707e5/1/1 - Feel free to fiddle yourself! ;) – Lucero Mar 23 '16 at 22:57
OK, I must admit, I'm surprised by this order of magnitude. Thanks for the effort! – Lukas Eder Mar 23 '16 at 23:13
When you look at the query plan (which is also available in the fiddle BTW) you can clearly see what is happening. Also note that the index is used by both queries the same (and if you remove the index it will perform a table scan followed by a sort on both also the same), so while this makes the efficiency of the actual computation/filtering stand out more I think it is not unfair to have it in there. That said, I use the filtering technique in an application of mine (see https://stackoverflow.com/questions/3914093 ) so I am somewhat familiar with the performance characteristics. – Lucero Mar 23 '16 at 23:18
Again. I'm well aware of the conceptual differences, but I underestimated the negative effect on SQL Server. I ran your test on Oracle with "only" a 3x performance penalty on the `FIRST_VALUE()` solution. Unfortunately, cannot show on SQL Fiddle right now as it appears broken for Oracle... – Lukas Eder Mar 24 '16 at 08:01
Interesting! I have not worked on Oracle databases for over 10 years so I'm not up do date on its performance characteristics, but even a 3x penalty is huge when considering that the queries use the same windowed function partition and order. It may also be that `FIRST_VALUE()` is more costly to evaluate than `ROW_NUMBER()` because it has to fetch data from other rows in the same partition, whereas `ROW_NUMBER()` does not need any data. – Lucero Mar 24 '16 at 10:20
Indeed it is still a huge penalty, and I'm still surprised, because given the index, `FIRST_VALUE()` could just scan the index for the first leaf node and produce the row there. Surprisingly, in the Oracle execution plan, I can see that the entire window is sorted for `FIRST_VALUE()`, but not for `ROW_NUMBER()` because the `rk = 1` predicate is pushed down into the window function. See http://i.imgur.com/WPR9rzo.png vs http://i.imgur.com/CvhrCzh.png. I'm sure this could be further optimised by the DB engine, as I don't see anything *conceptually* worse in the `FIRST_VALUE()` approach. – Lukas Eder Mar 24 '16 at 10:28
... but this is a common thing in SQL. *Conceptually* (= in theory) there's no reason why some SQL should be slow :) – Lukas Eder Mar 24 '16 at 10:29
It's both relieving but also disappointing to know that even today's SQL query optimizers are not as smart as the human sitting before the computer. Nevertheless, as I wrote in my previous comment, my uneducated guess is that the independence on the partition row data for the `ROW_NUMBER()` function is key to the performance difference. Also, if window functions could be used as aggregate functions in a `GROUP BY`, things would maybe look different as well. – Lucero Mar 24 '16 at 10:36
"smart" is relative. There's a lot of intuition that makes humans "smart". But can you formally *prove* that a SQL transformation in an optimiser is valid? That's rather hard, and you want the optimiser to come up with a decision quick... The more transformations you want to support, the more possible plans an optimiser will need to explore. The price to optimise something as rare as `FIRST_VALUE()` might be too high. – Lukas Eder Mar 24 '16 at 10:39

Select minimal values from two different columns

2 Answers2

A note on performance: