In Postgres, how can I get the sub-grouping with the max value for a group?

Question

In Postgres, I have a historical table for a subway system that has this structure:

CREATE TABLE stop_history
(
    stop_id character varying,
    route_id character varying,
    next_stop_id character varying
);

I'm trying to figure out: For a stop and route, what is the most common next stop?

What I would need to do is: Group by stop, routes and next stop, and get the count of those groups. For each of those groups, get the group with the highest count for each stop_id and route_id combination.

How would I write a postgres query like this, and what indexes should I place on this table to maximize performance?

One of the challenges I'm running into is not being able to use count(*) or max(count(*)) in a where clause.

With sample data:

INSERT INTO stop_history VALUES ('101N', '1', NULL);
INSERT INTO stop_history VALUES ('102N', '1', '101N');
INSERT INTO stop_history VALUES ('103N', '1', '102N');
INSERT INTO stop_history VALUES ('104N', '1', '103N');
INSERT INTO stop_history VALUES ('104N', '1', '103N');
INSERT INTO stop_history VALUES ('104N', '1', '102N');
INSERT INTO stop_history VALUES ('104N', '1', '103N');
INSERT INTO stop_history VALUES ('104N', '1', '102N');
INSERT INTO stop_history VALUES ('101N', 'D', NULL);
INSERT INTO stop_history VALUES ('102N', 'D', '101N');
INSERT INTO stop_history VALUES ('102N', 'D', '101N');
INSERT INTO stop_history VALUES ('102N', 'D', NULL);

Expected output is:

Stop | Route | Most common Next Stop | Frequency
101N 1 NULL 1
102N 1 101N 1
103N 1 102N 1
104N 1 103N 3
101N D NULL 1
102N D 101N 2

Please add some sample data (formatted text, please) and the expected output based on that sample data. Ideally everything as `insert` statement including the necessary `create table` statement(s) — , Jan 16 '15 at 14:48
@a_horse_with_no_name added create and sample insert and expected output — Oved D, Jan 16 '15 at 15:23

score 2 · Answer 1 · answered Jan 16 '15 at 15:40

Something like this:

select distinct on (stop_id, route_id) stop_id, 
       route_id, 
       coalesce(next_stop_id, 'NULL'), 
       count(*) over (partition by route_id, stop_id, coalesce(next_stop_id, 'NULL')) as frequency
from stop_history
order by route_id, stop_id, frequency desc

The window function (count(*) over (...)) counts the frequency of the next_stop_id column.

The (Postgres) specific distinct on() is then used to reduce the result to only the ones with the highest frequency (this is achieved by the final order by ... frequence DESC)

SQLFiddle: http://sqlfiddle.com/#!15/66ff6/1

In Postgres, how can I get the sub-grouping with the max value for a group?

1 Answers1