How to remove first row in every grouped result ordered on date column in postgresql

Question

I have a sql data (table) which consists of n number of columns, but 3 are of particular interest

claim_id	patient_id	admission_date
1	p_1	2001-01-01
2	p_2	2001-02-01
3	p_1	2001-01-01
4	p_3	2001-02-02
5	p_2	2001-03-01
6	p_2	2001-04-01
7	p_1	2001-04-01

What I want is remove first row (based on admission_date) of every patient_id. In case there are two rows for any patient on his/her first date of admission (e.g. patient_id = p_1 having 2 rows 1 & 3 with same dates of admission) then remove all those rows; and get an output like this

claim_id	patient_id	admission_date
5	p_2	2001-03-01
6	p_2	2001-04-01
7	p_1	2001-04-01

I am not very well conversant with sql, but I can do that easily in R using duplicated. Is something like that, a memory efficient (I have to filter 10 million rows at least) is available in POSTGRESQL?

What I have tried? Basically nothing in sql. If that data was of less size I could have exported it in R and have done my analysis there.

Please help.

Claim_id is a primary key? {patient_id,admission_date} is a unique? — wildplasser, Jul 21 '21 at 09:26
Yes @wildplasser, claim_id is unique identifier key in this transaction table. `patient_id` + `admission_date` combination may not be unique, but if that is the condition, we can safely exclude all rows falling on first date of admission of that patient_id — AnilGoyal, Jul 21 '21 at 09:28
If there happens to be only one record for a patient_id: do you want to suppress that, too? — wildplasser, Jul 21 '21 at 09:32
yes like `patient_id` `p_3` has only 1 claim and I don't want that in output — AnilGoyal, Jul 21 '21 at 09:34

score 1 · Accepted Answer · 2021-07-21T09:37:19.913

You can use window functions for that:

select claim_id, patient_id, admission_date
from (
  select claim_id, patient_id, admission_date, 
         dense_rank() over (partition by patient_id order by admission_date) as rn
   from the_table
) t 
where rn > 1;

dense_rank() creates sequential numbers for each patient_id sorted by the admission_date. If two rows have the same admission_date, they will get the same "rank". The WHERE clause in the outer query simply removes the "first" row for each patient.

Tim Biegeleisen · Answer 2 · 2021-07-21T09:40:22.887

1

We can use exists logic here:

SELECT claim_id, patient_id, admission_date
FROM yourTable t1
WHERE EXISTS (SELECT 1 FROM yourTable t2
              WHERE t2.patient_id = t1.patient_id AND
                    t2.claim_id < t1.claim_id);

Given that some patients seems to have mutilple admissions on the same date, I am using the claim_id as an ordered identifier to determine what admission came first.

If performance of the above be a concern, then the following index should speed things up substantially:

CREATE INDEX idx ON yourTable (patient_id, claim_id);

edited Jul 21 '21 at 09:40

answered Jul 21 '21 at 09:34

Tim Biegeleisen

502,043
27
286
360

Thanks for the response, but I have seen that these join type of queries take days in my actual data to process – AnilGoyal Jul 21 '21 at 09:36
@AnilGoyal Then add an index per my suggestion. My query can be made performant if tuned properly, I believe. – Tim Biegeleisen Jul 21 '21 at 09:41
Creating indices will certainly enhance the performance. But not everytime index creation rights are given to analysts. In my case index creation rights are not given. – AnilGoyal Jul 21 '21 at 09:45
But still I am trying to do the filtering on both methods suggested. Thanks for your response. – AnilGoyal Jul 21 '21 at 09:46
@AnilGoyal : in that case, the data model should contain a PG on claim_id and an addtitional index (Non unique?) on {patient_id,admission_date} [see my first comment] – wildplasser Jul 21 '21 at 09:48
This solution was also great. Accepting other one as that served my purpose well. – AnilGoyal Jul 22 '21 at 15:57

How to remove first row in every grouped result ordered on date column in postgresql

2 Answers2