1

I want to create a column "Cured" that takes value 1 when the customer left have1 either on the day or in the two following days after appearing in have2. Data have1 is a dataset that defines when a client entered and left certain status. have2 shows when the client was contacted.

This is the code I made:

proc SQL;
create table want2 as
select b.*
,     case when b.Outcome = "Answered" or
                b.Outcome = "Answerphone Message Left" or       
                b.Outcome = "Answerphone Message Not Left" or
                b.Outcome = "No Answer"
                and
                a.Start_date <= b.Date <= a.End_date
                and
                a.End_date <= b.Date+2
                then 1 else 0 END as Cured    
                
from have1  a, have2 b
where a.ID=b.ID;
quit;

The datasets I'm working with are:

data have1;
infile datalines dlmstr=' | ';
input ID  Start_date :ddmmyy10. End_date :ddmmyy10.;
format date start_date date9.;
datalines;
ID | Start_date | End_date 
1  | 01/01/2021 | 03/01/2021
1  | 20/01/2021 | 21/01/2021
2  | 05/01/2021 | 07/01/2021
3  | 10/01/2021 | 30/01/2021
3  | 25/01/2021 | 25/01/2021
;;;
run;

data have2;
infile datalines dlmstr=' | ';
input ID  Date :ddmmyy10. Outcome ;
format Date date9.;
datalines;
ID | Date       | Outcome
1  | 01/01/2021 | Answered
2  | 05/01/2021 | Asnwerphone Message Left
3  | 12/01/2021 | Answerphone Message Left
3  | 25/01/2021 | No Answer
;;;
run;

In this case I should get the following:

ID | Date       | Outcome                   | Cured 
1  | 01/01/2021 | Answered                  | 1  
2  | 05/01/2021 | Asnwerphone Message Left  | 1
3  | 12/01/2021 | Answerphone Message Left  | 0
3  | 25/01/2021 | No Answer                 | 1

ID1 cured because he left have1 2 days after 01/01

ID2 also cured because they left have1 2 days after 05/01

ID3 didn't cure because they left have1 more than 15 days after 12/01

ID3 cured because they left have1 on the same day

This is the result with that code:

ID| Date      | Outcome  | Cured
1   01JAN2021   Answered    1
1   01JAN2021   Answered    0
2   05JAN2021   Asnwerph    0
3   12JAN2021   Answerph    0
3   25JAN2021   No Answe    0
3   12JAN2021   Answerph    0
3   25JAN2021   No Answe    1

Which is not right: first of all it duplicates observations, but it doesnt accurately selects those that should be 1 in Cured (ID 2).

Any idea what might be wrong/missing?

amestrian
  • 546
  • 3
  • 12
  • use explicit `join` instead of implicit join. especially in your current query there is no `where` and the join logic spread all over the place, it is so hard to tell what the actual join logic and what the execution plan really are. – T. Peter Feb 23 '21 at 00:41
  • BTW: in SQL, character constants are delimited by single quotes. If SAS decides different: avoid it. – wildplasser Feb 23 '21 at 00:51
  • Please explain in words what you are trying to do. What do HAVE1 and HAVE2 represent? What does CURED=1 mean? – Tom Feb 23 '21 at 04:29
  • Why should the last line have CURED=1? For ID=3 the date 2021-01-25 is not less than 2021-01-25. Did you want to change the test to be `a.Start_date <= b.Date <= a.End_date`? Also why doesn't the last line repeat? Once for each interval from HAVE1 for ID=3 that contain that date? – Tom Feb 23 '21 at 04:43
  • @Tom explained a bit more about my issue and you're right, I was missing that <= on the second condition. I don't really understand your question about the last line – amestrian Feb 23 '21 at 09:51
  • @T.Peter hi, I added a WHERE and changed a few things, I also explained a bit further what's my objective and what error i'm getting specifically. – amestrian Feb 23 '21 at 11:06

4 Answers4

1

I suspect that you want:

select b.*,
       (case when b.Outcome in ('Answered', 'Answerphone Message Left', 
 'Answerphone Message Not Left', 'No Answer') and
                  a.Start_date <= b.Date and
                  b.Date < a.End_date and
                  a.End_date <= b.Date+2
             then 1 else 0 
        end) as Cured 

I also wonder if you really want a Cartesian product. Usually, you would want join with a specific join condition.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • thank you, actually the second condition was `a.Start_date <= b.Date <= a.End_date` so I think it's okay to put it all together, right? I also included a `where` statement in my code, which seems to be working okay except that for some reason it duplicates observations. – amestrian Feb 23 '21 at 11:01
  • @amestrian . . . It might be okay in SAS-native SQL. However, it is not standard SQL and I don't think any databases support that syntax (well, more specifically, some support the syntax but it does not do what you expect). If you are going to use `proc sql`, you might as well use more SQLish code. – Gordon Linoff Feb 23 '21 at 13:41
  • Makes sense, I changed it now. Do you know what might be happening with the duplicate observations? I edited my original post to describe it – amestrian Feb 23 '21 at 13:49
0

In any programming code (beyond SAS and SQL), whenever you use OR with AND, always separate conditions with parentheses:

CASE 
   WHEN (b.Outcome = "Answered" OR 
         b.Outcome = "Answerphone Message Left" OR 
         b.Outcome = "Answerphone Message Not Left" OR 
         b.Outcome = "No Answer")
        AND
        (a.Start_date <= b.Date < a.End_date)
        AND 
        (a.End_date <= b.Date+2)
   THEN 1 
   ELSE 0
 END AS Cured 

Better yet, use IN operator for collection of values and avoid need of OR:

CASE 
   WHEN b.Outcome IN ("Answered",
                      "Answerphone Message Left",
                      "Answerphone Message Not Left",
                      "No Answer")
        AND
        a.Start_date <= b.Date < a.End_date
        AND 
        a.End_date <= b.Date+2
   THEN 1 
   ELSE 0
 END AS Cured 
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • The `IN` operator and parenthesis was good! I'm still having issues because for some reason the answers are being duplicated. Even after adding a `where` statement at the end of my code. – amestrian Feb 23 '21 at 10:51
0

Sounds like you want to GROUP the results and set CURED=1 if ANY of the end dates for that ID meet your criteria. So use the MAX() aggregate function.

I am not sure why you are testing the values of OUTCOME since there are no values in your example that are not included in your list of values to test for.

data have1;
  input ID  Start_date :yymmdd. End_date :yymmdd.;
  format Start_date End_date yymmdd10.;
datalines4;
1 2021-01-01 2021-01-03
1 2021-01-20 2021-01-21
2 2021-01-05 2021-01-07
3 2021-01-10 2021-01-30
3 2021-01-25 2021-01-25
;;;;

data have2;
  input ID  Date :yymmdd. Outcome $40.;
  format Date yymmdd10.;
datalines4;
1 2021-01-01 Answered
2 2021-01-05 Answerphone Message Left
3 2021-01-12 Answerphone Message Left
3 2021-01-25 No Answer
4 2021-01-25 No Answer
;;;;

proc SQL;
create table want2 as
select b.id
     , b.date
     , b.outcome
     , max(
         case when (a.Start_date <= b.Date <= a.End_date)
               and (a.End_date <= b.Date+2) then 1
         else 0 
       end
       ) as Cured    
from have1 a
 right join have2 b
 on a.id=b.id
group by b.id,b.date,b.outcome
;
quit;

Result:

Obs    ID          Date    Outcome                     Cured

 1      1    2021-01-01    Answered                      1
 2      2    2021-01-05    Answerphone Message Left      1
 3      3    2021-01-12    Answerphone Message Left      0
 4      3    2021-01-25    No Answer                     1
 5      4    2021-01-25    No Answer                     0

PS Displaying dates in D-M-Y or M-D-Y order will just confuse half of your audience.

Tom
  • 47,574
  • 2
  • 16
  • 29
  • Thank you. I do have to specify the outcome because I have more that I didn't include in the snippet. Don't we need to specify a.Start_date <= b.Date <= a.End_date, because if not how else will the program know which period to look for? Anyways I tried running this code and it still gives duplicates, I think I know why that is but idk how to solve it – amestrian Feb 23 '21 at 14:08
  • I'm going to make a new post because the original question was already answered – amestrian Feb 23 '21 at 14:10
  • Are you tying to take something else from HAVE1 so that you need know which observation to match on? If you match just on DATE between START and END then ID=3 should have two output records since it matches two observations in HAVE1. – Tom Feb 23 '21 at 14:20
  • Okay I see what you mean, sorry that was a bad example, I didn't realize the dates were superposed. Ideally that wouldn't happen. – amestrian Feb 23 '21 at 14:23
0

Complex logic generally needs parentheses to properly group expressions.

The SQL logic operator AND has precedence over OR

A linearly coded logic statement such as

P OR Q or R OR S AND X AND Y AND Z

evaluates as

P or Q or R or (((S and X) and Y) and Z)

and generally needs to be corrected to

(P OR Q or R OR S) AND X AND Y AND Z

As stated by @Parfait, you can use IN operator when P Q R S are all expressions dealing with matching a single character variable with a number of different terms. Use IN: to match term prefixes.

Richard
  • 25,390
  • 3
  • 25
  • 38