I work with medical data, and prefer to develop analyses in a package environment, taking advantage of R CMD check
, testthat
and devtools
.
A typical analysis will begin by extracting data from a database (often with lengthy joins and many rows, so not a trivial step).
My main goal is to protect health information, while enabling reproducible analysis. Although I can de-identify data, I remain concerned that there is a lot of potentially identifying information, even if it is officially de-identified. Thus I treat even de-identified data very carefully. The data is about 100 to 500Mb per analysis.
Putting data in the data
directory in a package seems to be the worst solution: the data is large, making package creation slow; it is static, when the queries are varied and change over time; and makes it harder to share just the code when I don't want to transmit protected data.
I have tried R.cache
, memoise
, and using cached knitr
blocks in markdown documents.
R.cache
seems best right now, but dumps large amounts of obscurely named data in the home directory. memoise
was not really flexible enough, and seemed much better for temporary caching of calculations rather than database queries. knitr
caching worked okay for markdown, but is unavailable for straight up interactive R use.
Does anyone have any other recommendations or suggestions for packaged-based analysis with moderately large amounts of protected data?