In Postgres, I have a historical table for a subway system that has this structure:
CREATE TABLE stop_history
(
stop_id character varying,
route_id character varying,
next_stop_id character varying
);
I'm trying to figure out: For a stop and route, what is the most common next stop?
What I would need to do is: Group by stop, routes and next stop, and get the count of those groups. For each of those groups, get the group with the highest count for each stop_id and route_id combination.
How would I write a postgres query like this, and what indexes should I place on this table to maximize performance?
One of the challenges I'm running into is not being able to use count(*)
or max(count(*))
in a where clause.
With sample data:
INSERT INTO stop_history VALUES ('101N', '1', NULL);
INSERT INTO stop_history VALUES ('102N', '1', '101N');
INSERT INTO stop_history VALUES ('103N', '1', '102N');
INSERT INTO stop_history VALUES ('104N', '1', '103N');
INSERT INTO stop_history VALUES ('104N', '1', '103N');
INSERT INTO stop_history VALUES ('104N', '1', '102N');
INSERT INTO stop_history VALUES ('104N', '1', '103N');
INSERT INTO stop_history VALUES ('104N', '1', '102N');
INSERT INTO stop_history VALUES ('101N', 'D', NULL);
INSERT INTO stop_history VALUES ('102N', 'D', '101N');
INSERT INTO stop_history VALUES ('102N', 'D', '101N');
INSERT INTO stop_history VALUES ('102N', 'D', NULL);
Expected output is:
Stop | Route | Most common Next Stop | Frequency
101N 1 NULL 1
102N 1 101N 1
103N 1 102N 1
104N 1 103N 3
101N D NULL 1
102N D 101N 2