Partition by query behaves differently with slightly different input data

Question

Select the foo row with the highest count of bar. in the case of matching rows with the same count, select the row with the highest id.

I have solved the problem, but I wish to understand why the problem existed, and how the faulty query was able to produce different results. I will include my solution bellow as well.

I have boiled the problem down to a minimal data set:

Consider the following query:

with barCounter AS
(
    SELECT 
        foo, 
        bar, 
        COUNT(bar) AS barCount
    FROM @input it
    GROUP By foo, bar
)
SELECT 
    MAX(it.id) AS id, 
    foobar.foo
FROM @input it
INNER JOIN
    (
        SELECT DISTINCT 
            foo, 
            FIRST_VALUE(bar) OVER (PARTITION BY foo ORDER BY barCount desc) AS bar, 
            max(barCount) OVER (PARTITION BY foo) AS barCount
        FROM barCounter
    ) foobar
    ON it.foo = foobar.foo AND it.bar = foobar.bar
Group BY foobar.foo, foobar.bar

Ran normally, the above query will return the following:

id  foo
2   123
5   456

but if I remove the 123 records and only do the query on:

foo bar id
456 1   4
456 2   5

I get:

id foo
4  456

To explain this behavior. For some reason it returns different results when there are no rows separating the 456 rows from the inner join select distinct clause. For example this is what I'm referring to:

foo bar barCount      vs.     foo bar barCount
123 1   2                     456 1   1
456 1   1                     456 2   1
123 2   1
456 2   1

Now I have already fixed this bug in the query by changing the following:

I removed the "first_value" aggregator and I am only taking the highest barCount for that inner select clause. I also changed the main select clause max aggregator to a partition rather than a group by, and adding a distinct keyword accordingly:

-- with barCounter as () same as above

SELECT DISTINCT
    MAX(it.id) OVER (PARTITION BY it.foo) AS id, 
    it.foo
FROM @input it
INNER JOIN
    (
        SELECT
            foo, 
            bar, 
            max(barCount) over (PARTITION BY foo) AS barCount
        FROM barCounter
    ) foobar
ON it.foo = foobar.foo AND it.bar = foobar.bar

So anyway, why does the first buggy query return a different result for each input data sets above? I don't understand really WHY my solution fixes the problem, all I know is the (2nd query) inner join query actually groups the rows together (like a partition). the first query has the rows in random order. How this affects the original query, I'm not sure.

score 1 · Accepted Answer · edited Sep 23 '22 at 23:04

A result set is like any other table unsorted, so when you don't give a proper order, the result of a query can be different.

If you has also ordered the window function after bar, like you would do in real life, you get the wanted result, ordering without bar will lead to unforseen result, that basically are deterministic, because a programm will choose always the same way, as long it wasn't reprogrammed.

in short, sort proper.

    SELECT 
        foo, 
        bar, 
        COUNT(bar) AS barCount
    FROM input it
    GROUP By foo, bar

| foo | bar | barCount |
| ---:|----:|---------:|
| 123 | 1   | 2        |
| 456 | 1   | 1        |
| 123 | 2   | 1        |
| 456 | 2   | 1        |

with barCounter AS
(
    SELECT 
        foo, 
        bar, 
        COUNT(bar) AS barCount
    FROM input it
    GROUP By foo, bar
)
SELECT DISTINCT 
            foo, 
            FIRST_VALUE(bar) OVER (PARTITION BY foo ORDER BY barCount desc,bar DESC) AS bar, 
            max(barCount) OVER (PARTITION BY foo) AS barCount
        FROM barCounter

| foo | bar | barCount |
| ---:|----:|---------:|
| 123 | 1   | 2        |
| 456 | 2   | 1        |

with barCounter AS
(
    SELECT 
        foo, 
        bar, 
        COUNT(bar) AS barCount
    FROM input it
    GROUP By foo, bar
)
SELECT DISTINCT 
            foo, 
            FIRST_VALUE(bar) OVER (PARTITION BY foo ORDER BY barCount desc,bar ASC) AS bar, 
            max(barCount) OVER (PARTITION BY foo) AS barCount
        FROM barCounter

| foo | bar | barCount |
| ---:|----:|---------:|
| 123 | 1 --| 2 -------|
| 456 | 1 --| 1 -------|

fiddle

Partition by query behaves differently with slightly different input data

1 Answers1