3

If one is building a substantial, organization-wide code base in R, is it acceptable practice to rely on the sqldf package as the default approach for data munging tasks? Or is best practice to rely on operations with R specific syntax where possible? By relying on sqldf, one is introducing a substantial amount of a different syntax, SQL, into their R code base.

I'm asking this question with specific regard to maintainability and style. I've searched existing R style guides and did not find anything on this subject.

EDIT: To clarify the workflow I'm concerned with, consider a data munging script making ample use of sqldf as follows:

library(sqldf)
gclust_group<-sqldf("SELECT clust,SUM(trips) AS trips2
                FROM gclust
                GROUP BY clust")

gclust_group2<-sqldf("SELECT g.*, h.Longitude,h.Latitude,h.withinss, s.trips2
                 FROM highestd g
                 LEFT JOIN centers h
                 ON g.clust=h.clust
                 LEFT JOIN gclust_group s
                 ON g.clust=s.clust")

And such a script could continue for many lines. (For those familiar with Hadoop and PIG, the style is actually similar to a PIG script). Most of the work is done using SQL syntax, albeit with the benefit of avoiding complex subqueries.

tmthydvnprt
  • 10,398
  • 8
  • 52
  • 72
Ben Rollert
  • 1,564
  • 1
  • 13
  • 21
  • Does efficiency is a concern? If yes is not a good choice and by far is not the easier to maitain( thtought you store "sql" code separately and call it using `readLines`). – agstudy Feb 08 '14 at 05:33
  • If you have and sql background maybe you can investigate `data.table` package which has more decent maintaible workflow and has some analogy with sql.(read the faq 2.16)[[http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf]. – agstudy Feb 08 '14 at 05:35
  • 1
    I would have thought for maintainability, `dplyr` is preferable as it uses SQL or other external stores and imposes the same syntax for all, with similar operations for internal tables too eg `data.frame`, `data.table`. – Stephen Henderson Feb 08 '14 at 08:39
  • Spacedman's answer makes a lot of sense. If you write clear functions, the package referenced in the function internals is not important; rather, the focus is on efficiency for the task at hand. – Ben Rollert Feb 08 '14 at 13:45
  • It depends on the background of the maintainers. If they know SQL and not much R then its a lot easier to maintain SQL. If they know R well R may be preferrable. If the data is persistent then using a database seems important but if its temporary then it is not an issue. Whether to use sqldf or driver packages (RSQLite, etc.) directly depends on what you are doing. Its not true that sqldf is overall slower than R although there may be individual instances (mostly small examples where the speed does not matter anyways). Some users use sqldf to get a speed advantage over R. – G. Grothendieck Feb 08 '14 at 14:00

1 Answers1

3

Write functions. Functions with clear names that describe their purpose. Document them. Write tests.

Whether the functions contain sqldf parts, or use dplyr, or use bare R code, or call Rcpp is at that level irrelevant.

But if you want to try changing something from sqldf to dplyr the important thing is that you have a stable platform on which to experiment, which means well-defined functions and a good set of tests. Maybe there's a bottleneck in one function that might run 100x faster if you do it with dplyr? Great, you can profile and test the code with both.

You can even branch your code and have a sqldf branch and a dplyr branch in your revision control system (you are using an RCS, right?) and work in parallel until you get a winner.

It honestly doesn't matter if you are introducing other bits of syntax into your R code from a maintainability perspective if your codebase is well-documented and tested.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • Yes, I use Git for RCS. Understood that other bits of syntax inside function internals is not particularly important. – Ben Rollert Feb 08 '14 at 13:33