Step by step:
First, get the counts of rows per (PID, CID)
. This is simple:
SELECT
PID,
CID,
COUNT(*) AS cnt
FROM checks
GROUP BY
PID,
CID
And you get this result set for your example:
PID CID cnt
--- --- ---
p1 c1 2
p1 c2 3
p2 c1 6
p2 c2 5
Now, throw in COUNT(*) OVER (PARTITION BY PID)
to return the number of categories per person:
SELECT
PID,
CID,
COUNT(*) AS cnt,
COUNT(*) OVER (PARTITION BY PID) AS cat_cnt
FROM checks
GROUP BY
PID,
CID
The OVER
clause turns a "normal" aggregate function COUNT()
into a window aggregate function. That makes the COUNT(*)
operate on the grouped row set rather than the source one. So, COUNT(*) OVER ...
in this case counts rows per PID
, which for us has the meaning of category counts per person. And this is the updated result set:
PID CID cnt cnt_cat
--- --- --- -------
p1 c1 2 2
p1 c2 3 2
p2 c1 6 2
p2 c2 5 2
One more thing left is to rank the cnt
values per PID
. This may be tricky as there may be ties at the top counts. If you always want a single row per PID
and are perfectly indifferent to which CID, cnt
will be in case of a tie, you can modify the query like this:
SELECT
PID,
CID,
COUNT(*) AS cnt,
COUNT(*) OVER (PARTITION BY PID) AS cat_cnt,
ROW_NUMBER() OVER (PARTITION BY PID ORDER BY COUNT(*) DESC) AS rn
FROM checks
GROUP BY
PID,
CID
And this is what the result set will look like:
PID CID cnt cnt_cat rn
--- --- --- ------- --
p1 c1 2 2 2
p1 c2 3 2 1
p2 c1 6 2 1
p2 c2 5 2 2
At this point, the results contain all the data necessary to produce the final output, you just need to filter on cnt_cat
and rn
. However, you cannot do that directly. Instead, use the last query as a derived table, be it a WITH
table expression or a "normal" subselect. Below is an example using WITH
:
WITH grouped AS (
SELECT
PID,
CID,
COUNT(*) AS cnt,
COUNT(*) OVER (PARTITION BY PID) AS cat_cnt,
ROW_NUMBER() OVER (PARTITION BY PID ORDER BY COUNT(*) DESC) AS rn
FROM checks
GROUP BY
PID,
CID
)
SELECT PID, CID, cnt
FROM grouped
WHERE cat_cnt > 1
AND rn = 1
;
Here's a SQL Fiddle demo (using Oracle): http://sqlfiddle.com/#!4/cd62d/8
To expand a bit more on the ranking part, if you still want to return a single CID, cnt
per PID
but would prefer to have more control on what row should be decided as the "winner", you'll need to add a tie-breaker to the ORDER BY
clause of the ranking function. As an example, you could modify the original expression,
ROW_NUMBER() OVER (PARTITION BY PID ORDER BY COUNT(*) DESC) AS rn
with this one:
ROW_NUMBER() OVER (PARTITION BY PID ORDER BY COUNT(*) DESC, CID) AS rn
I.e. the tie-breaker is CID
, and so of the two or more CID
s with the top count, the one that sorts before the others wins.
Still, you may want to decide to return all the top counts per PID
. In that case, use either RANK()
or DENSE_RANK()
instead of ROW_NUMBER()
(and no tie-breaker), i.e. like this:
RANK() OVER (PARTITION BY PID ORDER BY COUNT(*) DESC) AS rn