2

I am trying to join two datasets based on variables office_id and office_id_flag, each with 50,000 observations.

data1 variables

  1. grp
  2. age
  3. hispanic
  4. id
  5. race
  6. sex
  7. mental_ill
  8. mental_ill_dx
  9. office_id
  10. office_id_flag

data2 variables

  1. er_vis
  2. adm_hr
  3. psych_hosp
  4. psych_vis
  5. region
  6. region_rpc
  7. charges
  8. office_id
  9. office_id_flag

I first tried this in data step merge procedure because I know how that works, and the resulting dataset had 50,000 observations and 17 variables, which makes sense to me. Merge code:

proc sort data=data1; by office_id; run;
proc sort data=data2; by office_id; run;

data work.merge_datastep;
    merge data1 (in=dem) data2 (in=hosp);
    by office_id;
    if dem and hosp;
run;
*53000 observations and 17 variables;

I've tried so many SQL merge types and each one produces a dataset with 56000 observations and 17 variables, which doesn't make sense.

Here's what I've tried in SQL:

proc sql;
create table sql_outerjoin as 
 select * 
 from data1 full outer join data2 on data1.office_id=data2.office_id;
quit;


proc sql;
create table sql_leftjoin as 
 select * from data1 left outer join data2 on data1.office_id=data2.office_id;
quit;

proc sql;
create table work.sql_innerjoin as
select data1.*, data2.*
from work.data1, work.data2
where data1.office_id=data2.office_id;
quit; 

proc sql ;
create table sql_try1 as 
select one.*, two.*
from data1 as one
left join data2 as two
on (one.office_id = two.office_id and one.office_id_flag= two.office_id_flag);
quit;


proc sql;
create table sql_try3 as
select coalesce(a.office_id, b.office_id) as ID
from data1 a
full join data2 b
on a.ID = b.ID;
quit;

proc sql;
  create table sparcs_1 as
  select * 
  from data1, data2
  where data1.office_id=data2.office_id;
quit;


proc sql;
create table work.sql_leftjoin2 as
select s.*, d.*
from work.data1 as s left join work.data2 as d
on s.office_id=d.office_id;
quit; 

Additionally, every one of these proc sql tries has resulted in these errors:

  1. WARNING: Variable office_id already exists on file WORK.whatever_table_name.
  2. WARNING: Variable office_id_flag already exists on file WORK.whatever_table_name.

I'd like the final result to be identical to the datastep merge procedure and have all variables and 50,000 observations, matched on office_id and office_id_flag.

I'm out of my depth and am obviously shooting in the dark with this. Does anyone have any suggestions?

rlybby
  • 21
  • 1
  • Why do you have multiple observations for the same OFFICE_ID value? What are the variables that unique identify the observations in each of the input datasets? How much larger did it get? Is it possible the duplicates are just for one value of OFFICE_ID? Perhaps a missing vlaue or a code that means id unknown? – Tom Nov 17 '21 at 16:53

2 Answers2

0

If the number of observations is not increasing on data step merge, but is increasing in SQL joins, then you have data that is not unique on the merge/join key. SAS will not increase the number of rows in a merge when there are duplicate rows - it also probably won't do what you want, but it might do something tolerable. SQL, however, will produce (depending on how you do it) extra rows for each combination.

How to solve this? Make sure you have a unique join key where no pair (or more) of rows share the same value in both datasets. Or, consider the right join type for the data you have - it might require summarizing the data.

If the data actually are unique, it's possible you just didn't include the exact combination you wanted.

proc sql;
  create table sparcs_1 as
  select * 
  from data1, data2
  where data1.office_id=data2.office_id
    and data1.office_id_flag=data2.office_id_flag
  ;
quit;

or

proc sql;
  create table sparcs_1 as
  select * 
  from data1 inner join data2
  on data1.office_id=data2.office_id
    and data1.office_id_flag=data2.office_id_flag
  ;
quit;

Those both require the row to match on both variables and come from both tables. But that's not what your SAS data step merge is doing - it's only depending on office_id, so it might give a different result. Odds are you cannot replicate this join in SQL perfectly - without adding some sort of row numbering - because the way SAS does the merge is just different, and not usually something you'd actually want to do in SQL.

Second, your warnings are because you're doing things like this:

select a.*, b.* 

If the same variable is in both datasets, then you're asking for it twice. Hence the warning. It's actually not a problem if the join key is the only variable that overlaps - then the result is what you want, just with the warning - but you can remove it by explicitly listing the variables you want from at least one of the tables. Best practice is to not use * at all, but it's understandable particularly when one table has a lot of variables.

Joe
  • 62,789
  • 6
  • 49
  • 67
  • Thanks, Joe! That explains the warning and extra observations. I've now tried this code: `proc sql ; create table sql_try4 as select data1.*, data2.er_vis, data2.adm_hr, data2.psych_hosp, data2.psych_vis, data2.region, data2.region_rpc, data2.charges from data1 left join data2 on (data1.pme_id = data2.pme_id and data1.pme_id_flag= data2.pme_id_flag); quit;` and am still seeing extra observations. How would I summarize the data? – rlybby Nov 17 '21 at 16:35
0

Thanks, Joe! Your comments were very helpful.

I ended up adding a row number based on observation row (from this previously asked question https://communities.sas.com/t5/SAS-Procedures/PROC-SQL-QUESTION-How-to-add-a-row-number-to-a-table-observation/td-p/167770), and testing the resulting dataset in both SAS and proc sql.

Code here:

proc sql;
  create table rows1 as
    select monotonic() as row, *
      from work.data1;
quit;

proc sql; 
    create table rows2 as 
        select monotonic() as row, *
            from work.data2;
quit;

proc sql ;
create table sql_rowmerge1 as 
select rows1.*, row2.er_vis, row2.adm_hr, row2.psych_hosp, row2.psych_vis, row2.region, row2.region_rpc, row2.charges
from rows1
left join rows2
on (rows1.row = rows2.row and rows1.office_id = rows2.office_id and rows1.office_id_flag= rows2.office_id_flag);
quit;

As you suggested, basing the merge on the 3 conditional variables, instead of just the 2 (probably not unique) variables, was what pulled both datasets together in the appropriate way. The resulting dataset now has the correct 50,000 observations.

Thanks again - This was great! I learned a new method!

rlybby
  • 21
  • 1
  • So why use the PROC SQL code at all if you are just going to jump through all of those hoops to make it work like the data step code? Just run the data step. Much easier, faster, clearer. – Tom Nov 17 '21 at 17:51
  • I agree - I would have rather used data step and finished it in fewer lines, but using proc sql is part of the assignment. – rlybby Nov 17 '21 at 18:25
  • "assignment"? From who? Ask them why they want you to use an inappropriate tool? Or perhaps there is something about the data you don't understand that makes using SQL code appropriate? – Tom Nov 17 '21 at 19:27