7

I am running Postgres 9.1.3 32-bit on Windows 7 x64. (Have to use 32 bit because there is no Windows PostGIS release compatible with 64 bit Postgres.) (EDIT: As of PostGIS 2.0, it is compatible with Postgres 64 bit on windows.)

I have a query that left joins a table (consistent.master) with a temporary table, then inserts the resulting data into a third table (consistent.masternew).

Since this is a left join, the resulting table should have the same number of rows as the left table in the query. However, if I run this:

SELECT count(*)
FROM consistent.master

I get 2085343. But if I run this:

SELECT count(*)
FROM consistent.masternew

I get 2085703.

How can masternew have more rows than master? Shouldn't masternew have the same number of rows as master, the left table in the query?

Below is the query. The master and masternew tables should be identically-structured.

--temporary table created here
--I am trying to locate where multiple tickets were written on
--a single traffic stop
WITH stops AS (
    SELECT citation_id,
           rank() OVER (ORDER BY offense_timestamp,
                     defendant_dl,
                     offense_street_number,
                     offense_street_name) AS stop
    FROM   consistent.master
    WHERE  citing_jurisdiction=1
)

--Here's the insert statement. Below you'll see it's
--pulling data from a select query
INSERT INTO consistent.masternew (arrest_id,
  citation_id,
  defendant_dl,
  defendant_dl_state,
  defendant_zip,
  defendant_race,
  defendant_sex,
  defendant_dob,
  vehicle_licenseplate,
  vehicle_licenseplate_state,
  vehicle_registration_expiration_date,
  vehicle_year,
  vehicle_make,
  vehicle_model,
  vehicle_color,
  offense_timestamp,
  offense_street_number,
  offense_street_name,
  offense_crossstreet_number,
  offense_crossstreet_name,
  offense_county,
  officer_id,
  offense_code,
  speed_alleged,
  speed_limit,
  work_zone,
  school_zone,
  offense_location,
  source,
  citing_jurisdiction,
  the_geom)

--Here's the select query that the insert statement is using.    
SELECT stops.stop,
  master.citation_id,
  defendant_dl,
  defendant_dl_state,
  defendant_zip,
  defendant_race,
  defendant_sex,
  defendant_dob,
  vehicle_licenseplate,
  vehicle_licenseplate_state,
  vehicle_registration_expiration_date,
  vehicle_year,
  vehicle_make,
  vehicle_model,
  vehicle_color,
  offense_timestamp,
  offense_street_number,
  offense_street_name,
  offense_crossstreet_number,
  offense_crossstreet_name,
  offense_county,
  officer_id,
  offense_code,
  speed_alleged,
  speed_limit,
  work_zone,
  school_zone,
  offense_location,
  source,
  citing_jurisdiction,
  the_geom
FROM consistent.master LEFT JOIN stops
ON stops.citation_id = master.citation_id

In case it matters, I have run a VACUUM FULL ANALYZE and reindexed both tables. (Not sure of exact commands; did it through pgAdmin III.)

Aren Cambre
  • 6,540
  • 9
  • 30
  • 36

2 Answers2

13

A left join does not necessarily have the same number of rows as the number of rows in the left table. Basically, it is like a normal join, except rows of the left table that would not appear in the normal join are also added. So, if you have more than one row in the right table that matches one row in the left table, you can have more rows in your results than the number of rows of the left table.

In order to do what you want to do, you should use a group by, and a count to detect multiples.

select citation_id
from stops join master on stops.citation_id = master.citation_id
group by citation_id
having count(*) > 1
Rémi
  • 3,705
  • 1
  • 28
  • 39
  • 1
    Thank you. That is exactly it. `LEFT JOIN` means the *minimum* number of rows in the join will be that of the left table. – Aren Cambre Mar 19 '12 at 02:22
7

Sometimes you know there are multiples, but don't care. You just want to take the first or top entry.
If so, you can use SELECT DISTINCT ON:

FROM consistent.master LEFT JOIN (SELECT DISTINCT ON (citation_id) * FROM stops) s
ON s.citation_id = master.citation_id

Where citation_id is the column that you want to take the first (any) row for each match.

You might want to ensure this is deterministic and use ORDER BY with some other orderable column:

SELECT DISTINCT ON (citation_id) * FROM stops ORDER BY citation_id, created_at
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 3
    sigh, google takes me to my own answer... i feel like there must be a better way! – Andy Hayden Oct 03 '19 at 03:07
  • Won't this always return the rows in ordered by citation_id? I'm guessing that's a primary key, auto increment... Doesn't really help with any desirable order by. – Kevin Parker Mar 02 '20 at 03:05
  • DISTINCT ON can take an array. so SELECT DISTINCT ON ( a , b ) .... ORDER BY a, b – demanic Mar 31 '23 at 07:00