2

Dataset HAVE includes two variables with misspelled names in them: names and friends.

Name   Age   Friend
Jon     11   Ann
Jon     11   Tom
Jimb    12   Egg
Joe     11   Egg
Joe     11   Anne
Joe     11   Tom
Jed     10   Ann

I have a small dataset CORRECTIONS that includes wrong_names and resolved_names.

current_names   resolved_names
Jon             John
Ann             Anne
Jimb            Jim

I need any name in names or friends in HAVE that matches a name in the wrong_names column of CORRECTIONS to get recoded to the corresponding string in resolved_name. The resulting dataset WANT should look like this:

Name   Age   Friend
John    11   Anne
John    11   Tom
Jim     12   Egg
Joe     11   Egg
Joe     11   Anne
Joe     11   Tom
Jed     10   Anne

In R, I could simply invoke each dataframe and vector using if_else(), but the DATA step in SAS doesn't play nicely with multiple datasets. How can I make these replacements using CORRECTIONS as a look-up table?

J.Q
  • 971
  • 1
  • 14
  • 29

4 Answers4

3

There are many ways to do a lookup in SAS.

First of all, however, I would suggest to de-duplicate your look-up table (for example, using PROC SORT and Data Step/Set/By) - deciding which duplicate to keep (if any exist).

As for the lookup task itself, for simplicity and learning I would suggest the following:

The "OLD SCHOOL" way - good for auditing inputs and outputs (it is easier to validate the results of a join when input tables are in the required order):

*** data to validate;
data have;
length name $10. age 4. friend $10.;
input name age friend;
datalines;
Jon     11   Ann
Jon     11   Tom
Jimb    12   Egg
Joe     11   Egg
Joe     11   Anne
Joe     11   Tom
Jed     10   Ann
run;

*** lookup table;
data corrections;
length current_names $10.  resolved_names $10.;
input current_names   resolved_names;
datalines;
Jon             John
Ann             Anne
Jimb            Jim
run;

*** de-duplicate lookup table;
proc sort data=corrections nodupkey; by current_names; run;

proc sort data=have; by name; run;   

data have_corrected;
    merge have(in=a) 
          corrections(in=b rename=(current_names=name))
          ;
    by name;
    if a;
    if b then do;
        name=resolved_names;
    end;
run;

The SQL way - which avoids sorting the have table:

proc sql;
    create table have_corrected_sql as
    select 
        coalesce(b.resolved_names, a.name) as name, 
        a.age, 
        a.friend
    from work.have as a left join work.corrections as b
    on a.name eq b.current_names
    order by name;
quit;

NB the Coalesce() is used to replace missing resolved_names values (ie when there is no correction) with names from the have table

EDIT: To reflect Quentin's (CORRECT) comment that I'd missed the update to both name and friend fields.

Based on correcting the 2 fields, again many approaches but the essence is one of updating a value only IF it exists in the lookup (corrections) table. The hash object is pretty good at this, once you've understood it's declaration.

NB: any key fields in the Hash object need to be specified on a Length statement BEFOREHAND.

EDIT: as per ChrisJ's alternative to the Length statement declaration, and my reply (see below) - it would be better to state that key variables need to be defined BEFORE you declare the hash table.

data have_corrected;
keep name age friend;
length current_names $10.;

    *** load valid names into hash lookup table;
    if _n_=1 then do;
        declare hash h(dataset: 'work.corrections');
        rc = h.defineKey('current_names');
        rc = h.defineData('resolved_names');
        rc = h.defineDone();
    end;
    do until(eof);
        set have(in=a) end=eof;
        *** validate both name fields;  
        if h.find(key:name) eq 0 then
            name = resolved_names;
        if h.find(key:friend) eq 0 then
            friend = resolved_names;
        output;
    end;
run;

EDIT: to answer the comments re ChrisJ's SQL/Update alternative

Basically, you need to restrict each UPDATE statement to ONLY those rows that have name values or friend values in the corrections table - this is done by adding another where clause AFTER you've specified the set var = (clause). See below.

NB. AFAIK, an SQL solution to your requirement will require MORE than 1 pass of both the base table & the lookup table.

The lookup/hash table, however, requires a single pass of the base table, a load of the lookup table and then the lookup actions themselves. You can see the performance difference in the log...

proc sql;
*** create copy of have table;
create table work.have_sql as select * from work.have;
*** correct name field;
update work.have_sql as u
    set name = (select resolved_names 
                from work.corrections as n
                where u.name=n.current_names)
    where u.name in (select current_names from work.corrections)
        ;
*** correct friend field;
update work.have_sql as u
    set friend = (select resolved_names 
                  from work.corrections as n
                  where u.friend=n.current_names)
    where u.friend in (select current_names from work.corrections)
        ;
quit;
Peter
  • 71
  • 1
  • 6
  • That's not included in the question – Peter May 27 '18 at 11:31
  • 1
    The question states: "I need any name in names or friends in HAVE that matches a name in the wrong_names column of CORRECTIONS to get recoded to the corresponding string in resolved_name." And the WANT data set shown has corrections to both `NAME` and `FRIEND`. – Quentin May 27 '18 at 11:41
  • 1
    @Quentin: apologies, you're right - missed that. Updated answer – Peter May 27 '18 at 12:45
  • 1
    To avoid the need to manually define lengths of variables in the hash table beforehand, simply add `if 0 then set corrections;` before the `if _n_ = 1 ...` statement. – Chris J May 27 '18 at 13:45
  • @ChrisJ: absolutely. either way, the main point is that SAS needs to know (at compilation time) the length and type of the variables used in the hash object key – Peter May 27 '18 at 14:16
  • Thanks for outlining multiple approaches! – J.Q May 27 '18 at 19:36
  • Nice hash solution. Good reminder of the optional KEY argument for the FIND method. – Quentin May 27 '18 at 22:46
2

You can use an UPDATE in proc sql :

proc sql ;
  update have a
    set name   = (select resolved_names b from corrections where a.name   = b.current_names)
    where name in(select current_names from corrections) 
  ;
  update have a
    set friend = (select resolved_names b from corrections where a.friend = b.current_names)
    where friend in(select current_names from corrections)
  ;
quit ;

Or, you could use a format :

/* Create format */
data current_fmt ;
  retain fmtname 'NAMEFIX' type 'C' ;
  set resolved_names ;
  start = current_names ;
  label = resolved_names ;
run ;
proc format cntlin=current_fmt ; run ;

/* Apply format */
data want ;
  set have ;
  name   = put(name  ,$NAMEFIX.) ;
  friend = put(friend,$NAMEFIX.) ;
run ;
Chris J
  • 7,549
  • 2
  • 25
  • 25
  • Cool - I've never used update. But when I try your PROC SQL solution, I get an error on your alias for corrections (b, I think). When I revise the code to make calls to each dataset explicit (e.g., have.name instead of a.name, and corrections.resolved_names), I get a very weird result - mostly empty cells in both "name" and "friend" in the new version of HAVE. Any thoughts? – J.Q May 27 '18 at 21:52
  • 1
    Edited to split into two updates, and apply a selection criteria to the update. – Chris J May 28 '18 at 14:41
2

Given data

*** data to validate;
data have;
length name $10. age 4. friend $10.;
input name age friend;
datalines;
Jon     11   Ann
Jon     11   Tom
Jimb    12   Egg
Joe     11   Egg
Joe     11   Anne
Joe     11   Tom
Jed     10   Ann
run;

*** lookup table;
data corrections;
length from_name $10.  to_name $10.;
input  from_name       to_name;
datalines;
Jon             John
Ann             Anne
Jimb            Jim
run;

One SQL alternative is to perform a existent mapping select look up on each field to be mapped. This would be counter to joining the corrections table one time for each field to be mapped.

proc sql;
  create table want1 as
  select 
      case when exists (select *       from corrections where from_name=name)
           then        (select to_name from corrections where from_name=name)
           else name
      end as name
    , age
    , case when exists (select *       from corrections where from_name=friend)
           then        (select to_name from corrections where from_name=friend)
           else friend
      end as friend
  from
    have
  ;

Another, SAS only way, to perform inline left joins is to use a custom format.

data cntlin;
  set corrections;
  retain fmtname '$cohen'; /* the fixer */
  rename from_name=start to_name=label;
run;
proc format cntlin=cntlin;
run;

data want2;
  set have;
  name = put(name,$cohen.);
  friend = put(friend,$cohen.);
run;
Richard
  • 25,390
  • 3
  • 25
  • 38
1

Try this:

proc sql;
create table want as
    select p.name,p.age,
        case 
            when q.current_names is null then p.friend 
            else q.resolved_names 
        end 
    as friend1
        from
            (
        select 
            case 
                when b.current_names is null then a.name 
                else b.resolved_names 
            end 
        as name,
            a.age,a.friend
        from
            have a
        left join
            corrections b
            on upcase(a.name) = upcase(b.current_names)
            ) p
        left join
            corrections q
            on upcase(p.friend) = upcase(q.current_names);
quit;  

Output:

name age friend
John 11  Anne
Jed  10  Anne
Joe  11  Anne
Jim  12  Egg
Joe  11  Egg
Joe  11  Tom
John 11  Tom

Let me know in case of any clarifications.

G.Arima
  • 1,171
  • 1
  • 6
  • 13
  • This does not produce the WANT dataset - if you use datalines you'll see that John is friends with John (himself?!) in the first line, but he is supposed to be friends with Anne. And quick Q, why does your case when search for "null" conditions in b.current_names? – J.Q May 27 '18 at 21:47
  • 1
    I have made the necessary changes and also attached the desired output. To answer your quick Q, I searched for "null" conditions for current name because i have joined on name and current name therefore, is the current name is null that means no changes are to be made, if the current is not null that means the name which is to be corrected is present in corrections table. – G.Arima May 28 '18 at 06:56