3

TL;DR:

The common dynamic document (IPython notebook style) approach to reproducible research usually does not result in reusable source code modules. Are there tools/approaches that use the source code as the primary medium and include text within it in order to make the code more reusable?

The problem with the common dynamic document approach

I really like the concept of reproducible research using dynamic documents/notebooks. It makes sense especially with data research and analysis where it is convenient to document and comment on the analysis process as it happens. I usually use Emacs Org-mode and/or IPython notebooks/kernels and it is quite well integrated. I also had a look at R and its analogs (ESS, knitr).

However, usually these documents consist of a series of code blocks which are expected to be executed in sequence. When tangled (source code extraction), the resulting source code usually is not easily reusable as a module or library.

And yet I so often think "oh, I wish I could just use that specific portion of the analysis I did a few days ago". It usually turns out that I have to execute most of the cells before the interesting part because of implicit dependencies. Including only a specific part of a weaved document usually also isn't easy. And even if it was (with Org-mode's #+INCLUDE directive or LaTeX catchfilebetweentags), usually different paragraphs aren't self-contained. Of course, I could just copy and edit the previous analysis document and copy/paste/transfer the relevant parts. But that kinda defeats the purpose.

To summarize:

The common dynamic notebook approach encourages a "linear" style of code development, i.e. code chunks to be executed in sequence and text paragraphs that follow a usually linear narrative and thus are usually not self-contained. This usually results in badly reusable tangled code and weaved (text document input) text documents.

Looking for a possible solutions

Here are some ideas for solving these issues I came up with so far.

Source code as the primary medium

After some though I came to the conclusion that the problems described above stem from the prose/text document being the primary medium. Such a text document with occasional figures and tables is in its nature a linear description of some narrative. And I think that this is what encourages the "too-linear" style.

If source code was the primary medium, the different declarations/definitions could be modular from the start and their documentation/explanation could be self-contained. A master document could then pick just the relevant portions according to the needed narrative. In some ways this is very close to how docstrings are used in Python and extracted and processed with Sphinx. Plot, table and value generation sequences can be a part of the test suite or example code.

However, this approach limits the interactivity in comparison to the common approach. Most of the interactive work would be done while creating and debugging unit tests or examples. Not that encouraging writing unit tests and examples is a bad thing, but it may be slower than the rapid testing/prototyping in e.g. IPython. On the other hand, it would be more consistent and perhaps better manageable.

"Non-linear" literate programming style using the power of noweb

Literate programming is closely related, but does not encourage this "too-linear" approach, e.g. noweb style references make a less "linear" and more modular style quite possible. It does not encourage it though.

However, it usually works well only when fully tangled. Also, it is not geared towards interactive use. Additionally, prose blocks cannot be referenced like code blocks can, so the text aspect is still "linear".

My questions

  • Is there anyone out there using such an approach with source code being the primary medium?
  • Has anyone using such an approach successfully reused previous reports in new ones?
  • Or is the power of "non-linear" noweb the way to go?
Ondřej Grover
  • 719
  • 1
  • 5
  • 13

0 Answers0