2

In Postgres, I have a historical table for a subway system that has this structure:

CREATE TABLE stop_history
(
    stop_id character varying,
    route_id character varying,
    next_stop_id character varying
);

I'm trying to figure out: For a stop and route, what is the most common next stop?

What I would need to do is: Group by stop, routes and next stop, and get the count of those groups. For each of those groups, get the group with the highest count for each stop_id and route_id combination.

How would I write a postgres query like this, and what indexes should I place on this table to maximize performance?

One of the challenges I'm running into is not being able to use count(*) or max(count(*)) in a where clause.

With sample data:

INSERT INTO stop_history VALUES ('101N', '1', NULL);
INSERT INTO stop_history VALUES ('102N', '1', '101N');
INSERT INTO stop_history VALUES ('103N', '1', '102N');
INSERT INTO stop_history VALUES ('104N', '1', '103N');
INSERT INTO stop_history VALUES ('104N', '1', '103N');
INSERT INTO stop_history VALUES ('104N', '1', '102N');
INSERT INTO stop_history VALUES ('104N', '1', '103N');
INSERT INTO stop_history VALUES ('104N', '1', '102N');
INSERT INTO stop_history VALUES ('101N', 'D', NULL);
INSERT INTO stop_history VALUES ('102N', 'D', '101N');
INSERT INTO stop_history VALUES ('102N', 'D', '101N');
INSERT INTO stop_history VALUES ('102N', 'D', NULL);

Expected output is:

Stop | Route | Most common Next Stop | Frequency
101N 1 NULL 1
102N 1 101N 1
103N 1 102N 1
104N 1 103N 3
101N D NULL 1
102N D 101N 2
IMSoP
  • 89,526
  • 13
  • 117
  • 169
Oved D
  • 7,132
  • 10
  • 47
  • 69

1 Answers1

2

Something like this:

select distinct on (stop_id, route_id) stop_id, 
       route_id, 
       coalesce(next_stop_id, 'NULL'), 
       count(*) over (partition by route_id, stop_id, coalesce(next_stop_id, 'NULL')) as frequency
from stop_history
order by route_id, stop_id, frequency desc

The window function (count(*) over (...)) counts the frequency of the next_stop_id column.

The (Postgres) specific distinct on() is then used to reduce the result to only the ones with the highest frequency (this is achieved by the final order by ... frequence DESC)

SQLFiddle: http://sqlfiddle.com/#!15/66ff6/1