How to elegantly + robustly cache external script in knitr rmd document?

Question

Say, I have an external R script external.R:

df.rand <- data.frame(rnorm(n = 100), rnorm(n = 100))

Then there's a main.Rmd:

\documentclass{article}

\begin{document}

<<setup, include = FALSE>>=
library(knitr)
library(ggplot2)
# global chunk options
opts_chunk$set(cache=TRUE, autodep=TRUE, concordance=TRUE, progress=TRUE, cache.extra = tools::md5sum("external.r"))
@

<<source, include=FALSE>>=
source("external.R")
@


<<plot>>=
ggplot(data = df.rand, mapping = aes(x = x, y = y)) + geom_point()
@

\end{document}

It's helpful to have this in an external script, because in reality, it's a bunch of import, data cleaning and simulation tasks that would pollute the main.Rmd.

Any chunks in main.Rmd depend on changes in the external script. To account for this dependency I added the above cache.extra = tools::md5sum("external.r").

That seems to work ok.

I'm looking for best practices.

Is this robust (enough)?
Is there a more elegant way to do this? (For example, it's unfortunate that any change in external.R will trigger a complete cache invalidation, rather than just invalidating only those objects that actually change).

There are no side effects (except for some library()calls, but I can move them to main.Rmd).

I'm always worried that I'm somehow doing it wrong.

What is the result of running `external.R`? Is only the object `df.rand` created or are there more objects or even side effects? — CL., Jul 10 '15 at 09:17
the result is a bunch of objects (dataframes), more than just `df.rand`. As far as I can tell, the only side effects are some `library()` calls (which I *could*/*should* move to `main.Rmd`. Incidentally, is there an R function that *tests* for side effects? — maxheld, Jul 10 '15 at 09:28

score 3 · Accepted Answer · answered Jul 10 '15 at 09:44

There should be better approaches than the do-it-yourself caching you currently use. To start with, you could split external.R into chunks:

# ---- CreateRandomDFs----
df.rand1 <- data.frame(rnorm(n = 100), rnorm(n = 100))
df.rand2 <- data.frame(rnorm(n = 100), rnorm(n = 100))

# ---- CreateOtherObjects----

# stuff

In main.Rmd, add (in a uncached chunk!) read_chunk(path = 'external.R'). Then execute the chunks:

<<CreateRandomDFs>>=
@
<<CreateOtherObjects>>=
@

If autodep doesn't work, add dependson to your chunks. A chunk that only uses df.rand1 and df.rand2 gets dependson = "CreateRandomDFs"; when other objects are also used, set dependson = c("CreateRandomDFs", "CreateOtherObjects").

You may also invalidate a chunk's cache when a certain object changes: cache.whatever = quote(df.rand1).

This way, you avoid invalidating the whole cache with any change in external.R. It is crucial how you split the code in that file into chunks: If you use too many chunks, you will have to list many dependencies; if you use too few chunks, cache gets invalidated more/too often.

Would you mind explaining a bit more the last two paragraphs of your excellent answer? It seems to describe a way to exclude from cache invalidation changes to a specified object (or objects) in external.R. — lawyeR, Jul 10 '15 at 10:05
In the example, I split `external.R` into two chunks. Changes in the second chunk won't invalidate the first chunk's cache or the cache of chunks that depend on the first chunk. We could put `df.rand1` and `df.rand2` into separate chunks. What happens? If one changes, the other is unaffected. But by the same token this means every chunk that depends on `df.rand1` and `df.rand2` needs to list two chunks as dependency now. Therefore I would group objects into the same chunk when they are likely to be required as dependency by the same chunks. — CL., Jul 10 '15 at 10:12

How to elegantly + robustly cache external script in knitr rmd document?

1 Answers1