0

Finally, I've decided to move my dissertation research closer toward the goal of making it as good reproducible research as it can be, given my circumstances. Since currently I don't use LaTeX for my dissertation report (though I'm considering this option), I believe that knitr is the best way to go.

The software project, implementing empirical part of my dissertation research (data analysis), is being written in R. The project's contains multiple files within directory structure, which is rather typical for scientific workflows (top level sub-directories: analysis, cache, data, figures, import, prepare, present, results, sandbox, utils).

I have read a lot of information (including examples) on using knitr for auto-generating reports and reproducible research, in general. However, I'm somewhat overwhelmed by multitude of configuration options and, more importantly, still confused on the best/correct/optimal approach for using knitr in projects like mine, containing multiple files and directories. In particular, I'm interested in advice on framework and steps for transitioning existing codebase without too many modifications in R modules.

As an example, let's consider my modules, related to exploratory data analysis (EDA). My current EDA workflow includes:

  • preliminary data, transformed from the original raw data (located in "data/transform" sub-directories);

  • module "eda.R", located in "analysis" directory;

  • directory "results/eda", where my current code is generating figures (SVG files) of univariate and multivariate EDA, as well as a single document report (PDF file) with the same graphical only information (generated descriptive statistics is being produced as a console output, when running the "eda.R" script).

In order to transition to knitr-based project, I have created file "eda-report.Rmd" with R Markdown statements for setting local knitr options, including read_chunk("eda.R"). My understanding is that now I need to define existing blocks of R code in "eda.R" as knitr chunks and then call these named chunks, according to my EDA workflow.

Questions:

Is it correct approach? What are best practices for using knitr in regard to setting up project paths, using source(), grouping some plots via gridExtra, preventing potential issues? It seems to me that, in addition to "eda-report.Rmd", I need to create another R module, which will be initiating processing of the .Rmd file by knitr. If Yes, which call should I use: rmarkdown::render() or knitr::knit() (while I use RStudio for development, I want my code to be independent from the development environment)?

UPDATE 1 (Additional question):

Why processing of an .Rmd file in RStudio via "Knit HTML" button produces HTML document, while processing via Makefile command Rscript -e 'library("knitr"); knit("eda-report.Rmd")' produces .md file, but not HTML, despite the presence of output: html_document directive?

Thank you for reading this! Your advice will be greatly appreciated!

Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • 1
    Regarding your UPDATE 1: what the `Knit HTML` button does is `rmarkdown::render()` instead of `knitr::knit(); the former calls the latter as the first step, then processes the .md output with Pandoc.` – Yihui Xie Jul 20 '14 at 15:12
  • @Yihui: Thank you! So, then, is the correct approach for using `knitr` via command line (in `Makefile`) and applying the `output` directive to use the following: `Rscript -e 'library("rmarkdown"); rmarkdown::render("eda-report.Rmd")'`? – Aleksandr Blekh Jul 20 '14 at 22:01
  • 1
    There is no _correct_ approach. It completely depends on whether you want to use Pandoc to process the *.md output from knitr. – Yihui Xie Jul 20 '14 at 22:46
  • @Yihui: I'm a little confused... I'd like to process `.Rmd` files, so that final resulting files would be generated, based on the `output` directives. What is the process for such conversion? – Aleksandr Blekh Jul 20 '14 at 23:05
  • 1
    In that case, the answer is `rmarkdown::render()`. knitr only runs the code chunks, and generate a markdown output document; the rest of the work is handled over to Pandoc via rmarkdown using the `output` directive. – Yihui Xie Jul 21 '14 at 03:24
  • @Yihui: Got it! Based on your previous comment, what packages should I be loading in my `Rscript` command mentioned above: `rmarkdown`, `knitr` or both? Thank you! – Aleksandr Blekh Jul 21 '14 at 04:45
  • 1
    You only need `rmarkdown::render()`. Loading knitr or not depends on whether you want to use any objects in knitr; see the section "The knitr package" on this page: http://rmarkdown.rstudio.com/authoring_migrating_from_v1.html – Yihui Xie Jul 21 '14 at 16:05

1 Answers1

2

In order to transition your workflow to using knitr, I suggest that rather than trying to make every last piece of code you write reproducible, you should start with the bits that will be most useful.

Since knitr is a report generation tool, the best place to start is by writing your dissertation in knitr. (You mention that you don't use LaTeX at the moment. That's fine: knitr also supports AsciiDoc, which I find easier to write. If your dissertation doesn't have many equations or tables, you might also get away with writing it in Markdown or Textile, which are even easier.)

Similarly, knitr is good for any reports or papers that you might write.

For more advanced usage, you can create presentations using knitr. (I sometimes knit xhtml Slidy presentations.)

What I wouldn't bother with is trying to knit all your exploratory data analysis. Most things you'll find are boring or dead ends, so it isn't worth the extra effort. Concentrate on exploring as fast as you can, then knit the interesting bits afterwards. Likewise, data cleaning isn't usually that interesting, so well commented code often suffices.


To answer your question about directory structure, my preference is that since knitr reports are for final output, they should be sandboxed away from scrappier exploratory work. That is, they can have their own directory, and produce their own copies of figures.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • Appreciate your fast and comprehensive reply! Generally, my approach to `knitr` transitioning is surprisingly close to your recommendations. Ideally, I'd prefer to convert most major dissertation artifacts, such as final dissertation report, defense presentation slides and maybe some intermediate reports, to being auto-generated by `knitr`. However, given extremely tight schedule and the fact that major work (core analysis using `SEM`) is yet to be done, I have to prioritize my tasks. If I'll have time to make my research more reproducible, I'll gladly do so. To be continued... – Aleksandr Blekh Jul 20 '14 at 10:20
  • I'm a bit surprised by your advice not to use `knitr` for `EDA`, especially because it seem to make sense, based on the rest of your advice. Actually, two main reasons for me deciding to start my experience with `knitr` by using it for generating `EDA` report are compatible with your overall advice: 1) `EDA` reports seem to be a **natural fit** for RR, in general, and knitr, in particular; 2) EDA reports are **limited in scope**, thus, allowing incremental progress, as well as to configuring, producing, debugging and storing such reports without much disruption to the rest of the project. – Aleksandr Blekh Jul 20 '14 at 10:34
  • 1
    OK, maybe I came across as too negative on knitting your explorations. I've expanded the comment for more clarity & balance. – Richie Cotton Jul 21 '14 at 12:49
  • 1
    This is necessarily biased by personal preference, but I find that keeping the R chunks in external R files (http://yihui.name/knitr/demo/externalization/) is the only way I can keep some flexibility with reproducible documents that change over time. The Rmd/Rnw source only contains references to code chunks, so I can refine and iterate over the code independently of the writing. – baptiste Jul 21 '14 at 14:31
  • @baptiste: Just discovered your comment. I agree with you. Actually, I've figured it out earlier and I'm already doing it (or should I say: "trying"). Would you share how do you deal with non-linear logic in external R files, in terms of referencing code chunks via `knitr`? Please see my most recent relevant SO question: http://stackoverflow.com/q/25715609/2872891. – Aleksandr Blekh Sep 08 '14 at 02:12