Identifying first unique value in PostgreSQL

Question

I have a table crashes with around one million rows, each row containing data on either:

every car crash that did not occur near a school, or
every car crash that occurred near a school, with additional rows per crash if it occurred near more than one school (e.g. 4 rows for a crash near 4 schools). The highest number of rows/nearby schools for one crash is 10.

I'd like to add a column to the table returning a "1" for only one appearance of every crash_id that appears in more than one row, and a "0" for any subsequent occurrences of that same crash_id in the column crash_id. Which row has a 1 or 0 per crash_id doesn't matter.

I've tried all of the suggestions offered in response to this similar question, but I couldn't get any of them to work for me.

FWIW, I got this to work in Excel using this formula:

=(COUNTIF($C$2:$C2,$C2)=1)+0

But that was for a small table, not one with one million rows.

What I've tried so far:

SELECT * 
FROM 
( 
    SELECT * , ROW_NUMBER() OVER(PARTITION BY crash_id) AS row 
    FROM crashes 
) AS A1 
WHERE row <6

SELECT * 
FROM 
(
    SELECT * , ROW_NUMBER() OVER(PARTITION BY crash_id) AS row 
    FROM crashes
) AS A1 
WHERE row = 1

I understand this isn't optimal database design, but it allows me to get most of what I need, except for what I'm describing above.

A [mcve] is a great start when asking SQL questions. Also note that homework related questions are expected to prove some extra effort. — jarlh, Dec 15 '21 at 21:40
"...for the first appearance of every unique crash_id..." -- How do you define which one of the 10 rows is the first one? Remember that in relational databases rows don't have injerent ordering. — The Impaler, Dec 15 '21 at 21:41
First, this is a poor database design. There should be a `crash` table, and a `crash-at-school` table with `crash_id` as a foreign key. Repetitions of the `crash_id` value in the `crash` table is a code smell. What is the unique key of the `crash` table? Without one, how will you identify the rows to be updated? — Andrew Lazarus, Dec 15 '21 at 21:48
**jarlh** this isn't homework. If you're asking to see what I've tried so far: ' SELECT * FROM ( SELECT * , ROW_NUMBER() OVER(PARTITION BY crash_id) AS row from crashes ) AS A1 WHERE row <6 ' ' SELECT * FROM ( SELECT * , ROW_NUMBER() OVER(PARTITION BY crash_id) AS row FROM joined ) as A1 WHERE row = 1 ' — AenidMansarde, Dec 15 '21 at 22:00
**TheImpaler** for my purposes it doesn't really matter which one is identified as first. I just want a column that will allow me to filter the table so that I only see one row per crash (i.e. WHERE id_unique = '1'). — AenidMansarde, Dec 15 '21 at 22:03
**Andrew Lazarus** Fair enough (I'm new to this). I created this table by joining a prior table `crashes` with a table `schools`, so there's nothing wrong with the data. Having the data organized this way enables me to get nearly everything I need from it, except for the column I'm seeking in my original question. There's also a primary key column with a unique number for each row. — AenidMansarde, Dec 15 '21 at 22:08

LukStorms · Accepted Answer · 2021-12-16T20:52:46.427

Just a simple test to add a first_crash column to crashes.

But it needs something to determine which row was first. Since tables are in essence unsorted sets.
The example uses the ID for that.

create table crashes (
 id serial primary key, 
 crash_id int, 
 school_id int
);

alter table crashes 
  add constraint uniq_school_crash unique (crash_id, school_id);

insert into crashes (crash_id, school_id) values
  (101,11), (101,10), (101,12)
, (102,25), (102,24), (102,23)
, (103,null)

alter table crashes
 add column first_crash int default 0;

update crashes c
set first_crash = 1
where first_crash = 0
  and ( 
       school_id is null
    or not exists (
      select 1
      from crashes c2
      where c2.crash_id = c.crash_id
        and c2.id < c.id
    ));

select * from crashes order by id

id | crash_id | school_id | first_crash
-: | -------: | --------: | ----------:
 1 |      101 |        11 |           1
 2 |      101 |        10 |           0
 3 |      101 |        12 |           0
 4 |      102 |        25 |           1
 5 |      102 |        24 |           0
 6 |      102 |        23 |           0
 7 |      103 |      null |           1

db<>fiddle here

Extra

update by row_number

-- using row_number
update crashes c
set first_crash = q.rn
from ( 
      select id
      , row_number() over (partition by crash_id 
                           order by id asc) as rn
      from crashes
) q
where q.rn = 1 
  and q.id = c.id;

using a temp table

-- using temporary table
create temporary table tmp_crashes (
 id int primary key, 
 crash_id int
);

insert into tmp_crashes (id, crash_id)
select min(id), crash_id
from crashes
group by crash_id
order by min(id);

update crashes t
set first_crash = 1
from tmp_crashes tmp
where tmp.id = t.id;

I tried this on my dataset and it was taking hours to execute. Any idea why / how to speed it up? Could it be that 1m-row dataset is too large for a query like this? — AenidMansarde, Dec 16 '21 at 14:59
That's probably one reason why it's slow, no matter what sql trick you use. — LukStorms, Dec 16 '21 at 20:11
OK, I'll have to learn about how to add indexes then. Thanks for your help. — AenidMansarde, Dec 16 '21 at 20:26
@AenidMansarde Just for fun, added 2 extra methods to the fiddle. — LukStorms, Dec 16 '21 at 20:54
Wow, thanks! I'll have to spend some time understanding these. — AenidMansarde, Dec 16 '21 at 22:09

Identifying first unique value in PostgreSQL

1 Answers1