4

Say, I have an external R script external.R:

df.rand <- data.frame(rnorm(n = 100), rnorm(n = 100))

Then there's a main.Rmd:

\documentclass{article}

\begin{document}

<<setup, include = FALSE>>=
library(knitr)
library(ggplot2)
# global chunk options
opts_chunk$set(cache=TRUE, autodep=TRUE, concordance=TRUE, progress=TRUE, cache.extra = tools::md5sum("external.r"))
@

<<source, include=FALSE>>=
source("external.R")
@


<<plot>>=
ggplot(data = df.rand, mapping = aes(x = x, y = y)) + geom_point()
@

\end{document}

It's helpful to have this in an external script, because in reality, it's a bunch of import, data cleaning and simulation tasks that would pollute the main.Rmd.

Any chunks in main.Rmd depend on changes in the external script. To account for this dependency I added the above cache.extra = tools::md5sum("external.r").

That seems to work ok.

I'm looking for best practices.

  • Is this robust (enough)?
  • Is there a more elegant way to do this? (For example, it's unfortunate that any change in external.R will trigger a complete cache invalidation, rather than just invalidating only those objects that actually change).

There are no side effects (except for some library()calls, but I can move them to main.Rmd).

I'm always worried that I'm somehow doing it wrong.

user227710
  • 3,164
  • 18
  • 35
maxheld
  • 3,963
  • 2
  • 32
  • 51
  • What is the result of running `external.R`? Is only the object `df.rand` created or are there more objects or even side effects? – CL. Jul 10 '15 at 09:17
  • the result is a bunch of objects (dataframes), more than just `df.rand`. As far as I can tell, the only side effects are some `library()` calls (which I *could*/*should* move to `main.Rmd`. Incidentally, is there an R function that *tests* for side effects? – maxheld Jul 10 '15 at 09:28

1 Answers1

3

There should be better approaches than the do-it-yourself caching you currently use. To start with, you could split external.R into chunks:

# ---- CreateRandomDFs----
df.rand1 <- data.frame(rnorm(n = 100), rnorm(n = 100))
df.rand2 <- data.frame(rnorm(n = 100), rnorm(n = 100))

# ---- CreateOtherObjects----

# stuff

In main.Rmd, add (in a uncached chunk!) read_chunk(path = 'external.R'). Then execute the chunks:

<<CreateRandomDFs>>=
@
<<CreateOtherObjects>>=
@

If autodep doesn't work, add dependson to your chunks. A chunk that only uses df.rand1 and df.rand2 gets dependson = "CreateRandomDFs"; when other objects are also used, set dependson = c("CreateRandomDFs", "CreateOtherObjects").

You may also invalidate a chunk's cache when a certain object changes: cache.whatever = quote(df.rand1).

This way, you avoid invalidating the whole cache with any change in external.R. It is crucial how you split the code in that file into chunks: If you use too many chunks, you will have to list many dependencies; if you use too few chunks, cache gets invalidated more/too often.

CL.
  • 14,577
  • 5
  • 46
  • 73
  • Would you mind explaining a bit more the last two paragraphs of your excellent answer? It seems to describe a way to exclude from cache invalidation changes to a specified object (or objects) in external.R. – lawyeR Jul 10 '15 at 10:05
  • 1
    In the example, I split `external.R` into two chunks. Changes in the second chunk won't invalidate the first chunk's cache or the cache of chunks that depend on the first chunk. We could put `df.rand1` and `df.rand2` into separate chunks. What happens? If one changes, the other is unaffected. But by the same token this means every chunk that depends on `df.rand1` and `df.rand2` needs to list two chunks as dependency now. Therefore I would group objects into the same chunk when they are likely to be required as dependency by the same chunks. – CL. Jul 10 '15 at 10:12