1

I have a process stages I need to find when did they happen. Unfortunately I can only do it though text extraction. I have extracted dates that match the stages found in the associated text column. Unfortunately some text values are similar and were inputed to the table incorrectly, and so extracted dates are duplicated. I cannot edit the existing table, however, I am trying to create a new one based on the manipulations I am doing (text extract, date ordering, etc). I need to make sure that the dates are not mixed up and are consecutive. For example stage 1 date is earlier than stage 2 etc...

Data sample

Person_ID Stage Stage_Num Stage_Date
1 Stage-1 1 2020-01-01
1 Stage-1 1 2021-02-02
1 Stage-2 2 2020-05-01
1 Stage-2 2 2020-05-06
1 Stage-1-R 3 2021-02-02
1 Stage-1-R 3 2020-01-01

Final result I am expecting is

Person_ID Stage Stage Num Stage_Date
1 Stage-1 1 2020-01-01
1 Stage-2 2 2020-05-01
1 Stage-1-R 3 2021-02-02

I've tried the following:

select a.person_id,a.stage, a.stage_num, min(a.stage_date)   
from
all_activity_dates a   join all_activity_dates b   
where
a.stage_num<=b.stage_num   
and a.stage_date<b.stage_date   
group by
a.person_id, a.stage, a.stage_num   
order by stage_num

However, the result date for stage 1 and 3 turn out to be the same. How do I perform it correctly?

UPD1 To whoever closed my question: I cannot simply select the min and group by STAGE (I definitely know how to do that) In my original question I state I need consecutive order, but with simple MIN DATE STAGE 1 and STAGE 3 would have the SAME DATE , BUT STAGE 3 HAS to be AFTER STAGE 2.

UPD2 Plus for stages I am missing to extract date I'd need to add a random date between previous and next stage

If it is a duplicate, I. am glad to check a working solution, I just could not find a similarly worded question

Kat Loka
  • 21
  • 3
  • For each stage, do you want the earliest date that is later than any of the dates on earlier stages? And are you looking for a query that can handle many stages (not just 3?) – Levin Magruder Dec 26 '21 at 22:13

1 Answers1

0

Here's a solution that will at least filter out the dates that would mess up the order of Stage_num.

select Person_ID, Stage, Stage_Num
, max(Stage_Date) as Stage_Date
from your_table t
where not exists (
        select 1 
        from your_table t2
        where t2.person_id = t.person_id
          and t2.stage_num = t.stage_num + 1
        group by t2.person_id, t2.stage_num
        having max(t2.stage_date) < t.stage_date
   )
group by Person_ID, Stage, Stage_Num
order by Person_ID, Stage_Num;
Person_ID Stage Stage_Num Stage_Date
1 Stage-1 1 2020-01-01
1 Stage-2 2 2020-05-06
1 Stage-1-R 3 2021-02-02

Demo on db<>fiddle here

But the thing about adding the missing Stage_nums with a random stage name & date is too unclear.
You could left join to a tally table. Or just add them manually.

LukStorms
  • 28,916
  • 5
  • 31
  • 45
  • Thank you! I will see what I can figure out where the dates are missing, but for dups this work perfectly – Kat Loka Dec 27 '21 at 15:48