6

I am in the midst of re-writing a SAS program I normally use into R. Many statements in the SAS program are written with proc sql, so for ease of programming I am replicating those statements with sqldf in R. All of the sql code works the same with sqldf as it does in SAS, except for this statement:

SAS:

proc sql;
create table merged32 as
select max(ctf) as ctf,
       max(ctms) as ctms,
       max(site_name) as site_name,
       provider_id, npi,
       first_name, middle_name, last_name, specialty,
       address_1, city, state, site_ct, zip, site_key
    from merged2
    group by 9,10,11,12,14,15;
quit;
run;

sqldf:

sqldf("select max(ctf) as ctf,
              max(ctms) as ctms,
              max(site_name) as site_name,
              provider_id, npi,
              first_name, middle_name, last_name, specialty,
              address_1, city, state, site_ct, zip, site_key
       from merged2
       group by 9,10,11,12,14,15")

In SAS, it returns a dataset with 1878 rows; in sqldf, a dataframe with 1375.

Are there any major differences between proc sql and sqldf that could be causing this, or in general that need to be considered? I didn't provide data because 1) the datasets are huge, and 2) I'm more just interested in knowing the differences between the two sql systems.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
user1445246
  • 303
  • 4
  • 14
  • 1
    Not sure about sqldf, but for meaningful results, you should group by *all* your key variables in SQL. – stevepastelan Aug 02 '12 at 19:19
  • 2
    I never knew why I tend to put the "key" variables first, and the aggregates last. now I know... – wildplasser Aug 02 '12 at 19:43
  • shoot, didn't even notice the missing key variables in the list. Thanks guys – user1445246 Aug 02 '12 at 19:51
  • 1
    @user1445246, has this question been answered through the inputs in the comments? If so, please feel free to add your solution as an answer and mark it as accepted to help remove it from the "unanswered question" queue. – A5C1D2H2I1M1N2O1R2T1 Oct 30 '12 at 09:16

2 Answers2

3

Not sure about sqldf, but for meaningful results, you should group by all your key variables in SQL.

stevepastelan
  • 1,264
  • 7
  • 11
0

The results in SAS maybe have duplicated records, while those in R do not

user1582755
  • 213
  • 1
  • 2
  • 9