1

I am looking for a way to make my R notebook-centric workflow be more reproducible and subsequently more easily containerized with Docker. For my medium-sized data analysis projects, I work with a very simple structure: a folder with an associated with an .Rproj and an index.html (that is a landing page for Github Pages) that holds other folders that have within them the notebooks, data, scripts, etc. This simple "1 GitHub repo = 1 Rproj" structure was also good for my nb.html files rendered by Github Pages.

.
└── notebooks_project
    ├── notebook_1
    │   ├── notebook_1.Rmd
    │   └── ...
    ├── notebook_2
    │   ├── notebook_2.Rmd
    │   └── ...
    ├── notebooks_project.Rproj
    ├── README.md
    ├── index.html 
    └── .gitignore 

I wish to keep this workflow that utilizes R notebooks both as literate programming tools and control documents (see RMarkdown Driven Development), as it seems decently suited for medium reproducible analytic projects. Unfortunately, there is a lack of documentation about Rmd-centric workflows using renv, although it seems to be well integrated with it.

Frist, Yihui Xie hinted here that methods related to using renv for individual Rmd documents include: renv::activate(), renv::use(), and renv::embed(). The renv::activate() does ony a part of what renv::init() does: it loads the project and sources the init.R. From my understanding, it does this if a project was already initialized, but it acts like renv::init() if project was not initialized: discovers dependencies, copies them to renv global package cache, writes several files (.Rprofile, renv/activate.R, renv/.gitignore, .Rbuildignore). renv::use() works well within standalone R scripts where the script's dependencies are specified directly within that script and we need those packages automatically installed and loaded when the associated script is run. renv::embed() just embeds a compact representation of renv.lock into a code chunk of the notebook - it changes the .Rmd on render/save by adding the code chunk with dependencies and deletes the call to renv::embed(). As I understand it, using renv::embed() and renv::use() could be sufficient for a reproducible stand-alone notebook. Nevertheless, I don't mind having the lock file in the directory or keeping the renv library as long as they are all in the same directory.

Second, preparing for subsequent Binder or Docker requirements, using renv together with RStudio Package Manager. Grant McDermott provides some useful code here (that may go in the .Rprofile or in the .Rmd itself, I think) and provides the rationale for it:

The lockfile is references against RSPM as the default package repository (i.e. where to download packages from), rather than one of the usual CRAN mirrors. Among other things, this enables time-travelling across different package versions and fast installation of pre-compiled R package binaries on Linux.

Third, I'd like to use the here package to work with relative paths. It seems the way to go so that the notebooks can run when transferred or when running inside Docker container. Unfortunately, here::here() looks for the .Rproj and will find it in my upper level folder (i.e. notebooks_project). A .here file that may be placed with here::set_here() overrides this behavior making here::here() point to the notebook folder as intended (i.e. notebook1). Unfortunately, the .here file takes effect only on restarting the R session or running unloadNamespace("here") (documented here).

Here is what I have experimented with untill now:

---
title: "<br> R Notebook Template" 
subtitle: "RMardown Report"
author: "<br> Claudiu Papasteri"
date: "`r format(Sys.time(), '%d %m %Y')`"
output: 
    html_notebook:
            code_folding: hide
            toc: true
            toc_depth: 2
            number_sections: true
            theme: spacelab
            highlight: tango
            font-family: Arial
---

```{r setup, include = FALSE}
  
# Set renv activate the current project
renv::activate()

# Set default package source by operating system, so that we automatically pull in pre-built binary snapshots, rather than building from source.
# This can also be appended to .Rprofile 
if (Sys.info()[["sysname"]] %in% c("Linux", "Windows")) {  # For Linux and Windows use RStudio Package Manager (RSPM)
    options(repos = c(RSPM = "https://packagemanager.rstudio.com/all/latest"))
    } else {
        # For Mac users, we default to installing from CRAN/MRAN instead, since RSPM does not yet support Mac binaries.
        options(repos = c(CRAN = "https://cran.rstudio.com/"))
        # options(renv.config.mran.enabled = TRUE) ## TRUE by default
    }
options(renv.config.repos.override = getOption("repos"))

# Install (if necessary) & Load packages
packages <- c(
  "tidyverse", "here"
)
renv::install(packages, prompt = FALSE)    # install packages that are not in cache
renv::hydrate(update = FALSE)              # install any packages used in the Rnotebook but not provided, do not update  
renv::snapshot(prompt = FALSE)


# Set here to Rnotebook directory
here::set_here()
unloadNamespace("here")                   # need new R session or unload namespace for .here file to take precedence over .Rproj
rrRn_name <- fs::path_file(here::here())

# Set kintr options including root.dir pointing to the .here file in Rnotebook directory
knitr::opts_chunk$set(root.dir = here::here())

# ???
renv::use(lockfile = here::here("renv.lock"), attach = TRUE)  # automatic provision an R library when Rnotebook is run and load packages
# renv::embed(path = here::here(rrRn_name), lockfile = here::here("renv.lock"))  # if run this embeds the renv.lock inside the Rnotebook

renv::status()$synchronized
```

I'd like my nobooks to be able to run without code change both locally (where dependencies are already installed, cached and where the project was initialized) and when transferred to other systems. Each notebook should have its own renv settings.

I have many questions:

  1. What's wrong with my renv sequence? Is calling renv::activate() on every run (both for initialization and after) the way to go? Should I use renv::use() instead of renv::install() and renv::hydrate()? Is renv::embed() better for a reproducible workflow even though every notebook folder should have its renv.lock and library? renv on activation also creates an .Rproj file (e.g. notebook1.Rproj) thus breaking my simple 1 repo = 1 Rproj - should this concern me?
  2. The renv-RSPM workflow seems great, but is there any advantage of storing that script in the .Rprofile as opposed to having it within the Rmd itself?
  3. Is ther a better way to use here? That unloadNamespace("here") seems hacky but it seems the only way to preserve a use for the .here files.
Claudiu Papasteri
  • 2,469
  • 1
  • 17
  • 30

1 Answers1

1

What's wrong with my renv sequence? Is calling renv::activate() on every run (both for initialization and after) the way to go? Should I use renv::use() instead of renv::install() and renv::hydrate()? Is renv::embed() better for a reproducible workflow even though every notebook folder should have its renv.lock and library?

If you already have a lockfile that you want to use + associate with your projects, then I would recommend just calling renv::restore(lockfile = "/path/to/lockfile"), rather than using renv::use() or renv::embed(). Those tools are specifically for the case where you don't want to use an external lockfile; that is, you'd rather embed your document's dependencies in the document itself.

The question about renv::restore() vs renv::install() comes down to whether you want the exact package versions as encoded in the lockfile, or whatever happens to be current / latest on the R package repositories visible to your session. I think the most typical workflow is something like:

  1. Use renv::install(), renv::hydrate(), or other tools to install packages as you require them;

  2. Confirm that your document is in a good, runnable state,

  3. Call renv::snapshot() to "save" that state,

  4. Use renv::restore() in future runs of your document to "load" that previously-saved state.

renv on activation also creates an .Rproj file (e.g. notebook1.Rproj) thus breaking my simple 1 repo = 1 Rproj - should this concern me?

If this is undesired behavior, you might want to file a bug report at https://github.com/rstudio/renv/issues, with a bit more context.

The renv-RSPM workflow seems great, but is there any advantage of storing that script in the .Rprofile as opposed to having it within the Rmd itself?

It just depends on how visible you want that configuration to be. Do you want it to be active for all R sessions launched in that project directory? If so, then it might belong in the .Rprofile. Do you only want it active for that particular R Markdown document? If so, it might be worth including there. (Bundling it in the R Markdown file also makes it easier to share, since you could then share just the R Markdown document without also needing to share the project / .Rprofile)

Is ther a better way to use here? That unloadNamespace("here") seems hacky but it seems the only way to preserve a use for the .here files.

If I understand correctly, you could just manually create a .here file yourself before loading the here package, e.g.

file.create("/path/to/.here")
library(here)

since that's all set_here() really does.

Kevin Ushey
  • 20,530
  • 5
  • 56
  • 88