2

I have a sql data (table) which consists of n number of columns, but 3 are of particular interest

claim_id patient_id admission_date
1 p_1 2001-01-01
2 p_2 2001-02-01
3 p_1 2001-01-01
4 p_3 2001-02-02
5 p_2 2001-03-01
6 p_2 2001-04-01
7 p_1 2001-04-01

What I want is remove first row (based on admission_date) of every patient_id. In case there are two rows for any patient on his/her first date of admission (e.g. patient_id = p_1 having 2 rows 1 & 3 with same dates of admission) then remove all those rows; and get an output like this

claim_id patient_id admission_date
5 p_2 2001-03-01
6 p_2 2001-04-01
7 p_1 2001-04-01

I am not very well conversant with sql, but I can do that easily in R using duplicated. Is something like that, a memory efficient (I have to filter 10 million rows at least) is available in POSTGRESQL?

What I have tried? Basically nothing in sql. If that data was of less size I could have exported it in R and have done my analysis there.

Please help.

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • Claim_id is a primary key? {patient_id,admission_date} is a unique? – wildplasser Jul 21 '21 at 09:26
  • Yes @wildplasser, claim_id is unique identifier key in this transaction table. `patient_id` + `admission_date` combination may not be unique, but if that is the condition, we can safely exclude all rows falling on first date of admission of that patient_id – AnilGoyal Jul 21 '21 at 09:28
  • If there happens to be only one record for a patient_id: do you want to suppress that, too? – wildplasser Jul 21 '21 at 09:32
  • yes like `patient_id` `p_3` has only 1 claim and I don't want that in output – AnilGoyal Jul 21 '21 at 09:34

2 Answers2

1

You can use window functions for that:

select claim_id, patient_id, admission_date
from (
  select claim_id, patient_id, admission_date, 
         dense_rank() over (partition by patient_id order by admission_date) as rn
   from the_table
) t 
where rn > 1;

dense_rank() creates sequential numbers for each patient_id sorted by the admission_date. If two rows have the same admission_date, they will get the same "rank". The WHERE clause in the outer query simply removes the "first" row for each patient.

1

We can use exists logic here:

SELECT claim_id, patient_id, admission_date
FROM yourTable t1
WHERE EXISTS (SELECT 1 FROM yourTable t2
              WHERE t2.patient_id = t1.patient_id AND
                    t2.claim_id < t1.claim_id);

Given that some patients seems to have mutilple admissions on the same date, I am using the claim_id as an ordered identifier to determine what admission came first.

If performance of the above be a concern, then the following index should speed things up substantially:

CREATE INDEX idx ON yourTable (patient_id, claim_id);
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Thanks for the response, but I have seen that these join type of queries take days in my actual data to process – AnilGoyal Jul 21 '21 at 09:36
  • @AnilGoyal Then add an index per my suggestion. My query can be made performant if tuned properly, I believe. – Tim Biegeleisen Jul 21 '21 at 09:41
  • Creating indices will certainly enhance the performance. But not everytime index creation rights are given to analysts. In my case index creation rights are not given. – AnilGoyal Jul 21 '21 at 09:45
  • But still I am trying to do the filtering on both methods suggested. Thanks for your response. – AnilGoyal Jul 21 '21 at 09:46
  • @AnilGoyal : in that case, the data model should contain a PG on claim_id and an addtitional index (Non unique?) on {patient_id,admission_date} [see my first comment] – wildplasser Jul 21 '21 at 09:48
  • This solution was also great. Accepting other one as that served my purpose well. – AnilGoyal Jul 22 '21 at 15:57