Select the foo
row with the highest count of bar
. in the case of matching rows with the same count, select the row with the highest id
.
I have solved the problem, but I wish to understand why the problem existed, and how the faulty query was able to produce different results. I will include my solution bellow as well.
I have boiled the problem down to a minimal data set:
foo bar id
123 1 1
123 1 2
123 2 3
456 1 4
456 2 5
Consider the following query:
with barCounter AS
(
SELECT
foo,
bar,
COUNT(bar) AS barCount
FROM @input it
GROUP By foo, bar
)
SELECT
MAX(it.id) AS id,
foobar.foo
FROM @input it
INNER JOIN
(
SELECT DISTINCT
foo,
FIRST_VALUE(bar) OVER (PARTITION BY foo ORDER BY barCount desc) AS bar,
max(barCount) OVER (PARTITION BY foo) AS barCount
FROM barCounter
) foobar
ON it.foo = foobar.foo AND it.bar = foobar.bar
Group BY foobar.foo, foobar.bar
Ran normally, the above query will return the following:
id foo
2 123
5 456
but if I remove the 123 records and only do the query on:
foo bar id
456 1 4
456 2 5
I get:
id foo
4 456
To explain this behavior. For some reason it returns different results when there are no rows separating the 456 rows from the inner join select distinct clause. For example this is what I'm referring to:
foo bar barCount vs. foo bar barCount
123 1 2 456 1 1
456 1 1 456 2 1
123 2 1
456 2 1
Now I have already fixed this bug in the query by changing the following:
I removed the "first_value" aggregator and I am only taking the highest barCount for that inner select clause. I also changed the main select clause max aggregator to a partition rather than a group by, and adding a distinct keyword accordingly:
-- with barCounter as () same as above
SELECT DISTINCT
MAX(it.id) OVER (PARTITION BY it.foo) AS id,
it.foo
FROM @input it
INNER JOIN
(
SELECT
foo,
bar,
max(barCount) over (PARTITION BY foo) AS barCount
FROM barCounter
) foobar
ON it.foo = foobar.foo AND it.bar = foobar.bar
So anyway, why does the first buggy query return a different result for each input data sets above? I don't understand really WHY my solution fixes the problem, all I know is the (2nd query) inner join query actually groups the rows together (like a partition). the first query has the rows in random order. How this affects the original query, I'm not sure.