4

I work with medical data, and prefer to develop analyses in a package environment, taking advantage of R CMD check, testthat and devtools.

A typical analysis will begin by extracting data from a database (often with lengthy joins and many rows, so not a trivial step).

My main goal is to protect health information, while enabling reproducible analysis. Although I can de-identify data, I remain concerned that there is a lot of potentially identifying information, even if it is officially de-identified. Thus I treat even de-identified data very carefully. The data is about 100 to 500Mb per analysis.

Putting data in the data directory in a package seems to be the worst solution: the data is large, making package creation slow; it is static, when the queries are varied and change over time; and makes it harder to share just the code when I don't want to transmit protected data.

I have tried R.cache, memoise, and using cached knitr blocks in markdown documents.

R.cache seems best right now, but dumps large amounts of obscurely named data in the home directory. memoise was not really flexible enough, and seemed much better for temporary caching of calculations rather than database queries. knitr caching worked okay for markdown, but is unavailable for straight up interactive R use.

Does anyone have any other recommendations or suggestions for packaged-based analysis with moderately large amounts of protected data?

Jack Wasey
  • 3,360
  • 24
  • 43
  • 1
    I'm kinda confused. If you want to enable reproducibility, are you requiring other folks to have other data to use? If you want it reproducible with your pkg data, you kinda have to give the data. Even if you encrypt it, anyone who can decrypt it can read it. – hrbrmstr Oct 25 '15 at 16:51
  • I wouldn't share the data itself, but the code would be able to reproduce the data by doing new database queries. However, when working locally, as described above in a package development style, I want to cache my data because the database queries and some analysis steps are slow. Regardless of the protected nature of the data, it's not clear to me what is a good strategy for developing with packages and using biggish data sets and caching. – Jack Wasey Oct 25 '15 at 17:38
  • 2
    Many thanks for the clarification. You'll have to be very sure to purge all knitr cache files that are not auto-deleted and you can encrypt/decrypt any Rdata files with https://github.com/hadley/secure (keep the key in your environment). That way you get to have your data and not be too worried abt it as long as you never leak the key (or cache files). Also, ensure any session-created `.Rdata` files are deleted. I applaud your desire to treat the data responsibly. That's a rarity (I work in cybersecurity). – hrbrmstr Oct 25 '15 at 17:45
  • Thanks for your replies. The whole computer is encrypted, and I wouldn't ever send the data itself, just the code to recreate it. I think I confused two topics in my question: the data security and, more generally, working with bulk data in a package-based analysis environment. – Jack Wasey Oct 25 '15 at 18:46
  • 1
    I have kept sensitive data in separate packages before. then you can share or upload the analysis package but keep the data on your local machine and load the data package when you need it. for the open package, you can create a sample data set with the same structure as the sensitive data so that all of your functions and markdown scripts can run without having the real data – rawr Nov 09 '15 at 23:27

0 Answers0