I am trying to join two datasets based on variables office_id and office_id_flag, each with 50,000 observations.
data1 variables
- grp
- age
- hispanic
- id
- race
- sex
- mental_ill
- mental_ill_dx
- office_id
- office_id_flag
data2 variables
- er_vis
- adm_hr
- psych_hosp
- psych_vis
- region
- region_rpc
- charges
- office_id
- office_id_flag
I first tried this in data step merge procedure because I know how that works, and the resulting dataset had 50,000 observations and 17 variables, which makes sense to me. Merge code:
proc sort data=data1; by office_id; run;
proc sort data=data2; by office_id; run;
data work.merge_datastep;
merge data1 (in=dem) data2 (in=hosp);
by office_id;
if dem and hosp;
run;
*53000 observations and 17 variables;
I've tried so many SQL merge types and each one produces a dataset with 56000 observations and 17 variables, which doesn't make sense.
Here's what I've tried in SQL:
proc sql;
create table sql_outerjoin as
select *
from data1 full outer join data2 on data1.office_id=data2.office_id;
quit;
proc sql;
create table sql_leftjoin as
select * from data1 left outer join data2 on data1.office_id=data2.office_id;
quit;
proc sql;
create table work.sql_innerjoin as
select data1.*, data2.*
from work.data1, work.data2
where data1.office_id=data2.office_id;
quit;
proc sql ;
create table sql_try1 as
select one.*, two.*
from data1 as one
left join data2 as two
on (one.office_id = two.office_id and one.office_id_flag= two.office_id_flag);
quit;
proc sql;
create table sql_try3 as
select coalesce(a.office_id, b.office_id) as ID
from data1 a
full join data2 b
on a.ID = b.ID;
quit;
proc sql;
create table sparcs_1 as
select *
from data1, data2
where data1.office_id=data2.office_id;
quit;
proc sql;
create table work.sql_leftjoin2 as
select s.*, d.*
from work.data1 as s left join work.data2 as d
on s.office_id=d.office_id;
quit;
Additionally, every one of these proc sql tries has resulted in these errors:
- WARNING: Variable office_id already exists on file WORK.whatever_table_name.
- WARNING: Variable office_id_flag already exists on file WORK.whatever_table_name.
I'd like the final result to be identical to the datastep merge procedure and have all variables and 50,000 observations, matched on office_id and office_id_flag.
I'm out of my depth and am obviously shooting in the dark with this. Does anyone have any suggestions?