2

The minimal reproducible example (RE) below is my attempt to figure out how can I use knitr for generating complex dynamic documents, where "complex" here refers not to the document's elements and their layout, but to non-linear logic of the underlying R code chunks. While the provided RE and its results show that a solution, based on such approach might work well, I would like to know: 1) is this a correct approach of using knitr for such situations; 2) are there any optimizations that can be made to improve the approach; 3) what are alternative approaches, which could decrease the granularity of code chunks.

EDA source code (file "reEDA.R"):

## @knitr CleanEnv
rm(list = ls(all.names = TRUE))

## @knitr LoadPackages
library(psych)
library(ggplot2)

## @knitr PrepareData

set.seed(100) # for reproducibility
data(diamonds, package='ggplot2')  # use built-in data


## @knitr PerformEDA

generatePlot <- function (df, colName) {

  df <- df
  df$var <- df[[colName]]

  g <- ggplot(data.frame(df)) +
    scale_fill_continuous("Density", low="#56B1F7", high="#132B43") +
    scale_x_log10("Diamond Price [log10]") +
    scale_y_continuous("Density") +
    geom_histogram(aes(x = var, y = ..density..,
                       fill = ..density..),
                   binwidth = 0.01)
  return (g)
}

performEDA <- function (data) {

  d_var <- paste0("d_", deparse(substitute(data)))
  assign(d_var, describe(data), envir = .GlobalEnv)

  for (colName in names(data)) {
    if (is.numeric(data[[colName]]) || is.factor(data[[colName]])) {
      t_var <- paste0("t_", colName)
      assign(t_var, summary(data[[colName]]), envir = .GlobalEnv)

      g_var <- paste0("g_", colName)
      assign(g_var, generatePlot(data, colName), envir = .GlobalEnv)
    }
  }
}

performEDA(diamonds)

EDA report R Markdown document (file "reEDA.Rmd"):

```{r KnitrSetup, echo=FALSE, include=FALSE}
library(knitr)
opts_knit$set(progress = TRUE, verbose = TRUE)
opts_chunk$set(
  echo = FALSE,
  include = FALSE,
  tidy = FALSE,
  warning = FALSE,
  comment=NA
)
```

```{r ReadChunksEDA, cache=FALSE}
read_chunk('reEDA.R')
```

```{r CleanEnv}
```

```{r LoadPackages}
```

```{r PrepareData}
```

Narrative: Data description

```{r PerformEDA}
```

Narrative: Intro to EDA results

Let's look at summary descriptive statistics for our dataset

```{r DescriptiveDataset, include=TRUE}
print(d_diamonds)
```

Now, let's examine each variable of interest individually.

Varible Price is ... Decriptive statistics for 'Price':

```{r DescriptivePrice, include=TRUE}
print(t_price)
```

Finally, let's examine price distribution across the dataset visually:

```{r VisualPrice, include=TRUE, fig.align='center'}
print(g_price)
```

The result can be found here:

http://rpubs.com/abrpubs/eda1

Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • This is a really interesting question, but I am afraid it's a bad match for stackoverflow as it is at the moment. Stackoverflow works much better for Q&A than for discussion. So, if you can phrase this more as a question that includes reproducible data and has a clear problem, I think this would be a great fit. – Andy Clifton Sep 08 '14 at 00:45
  • Agree with Andy. But in the meantime have you looked into `brew`? I haven't had a chance to try it but I hear it can do looping and more programming-like constructs in the document creation. – Aaron left Stack Overflow Sep 08 '14 at 01:39
  • @AndyClifton: Thank you for your feedback! I understand your point, but I still see from time to time similar questions getting answered here on SO. Nevertheless, I will see, if I can come up with a minimal reproducible example for this type of question. – Aleksandr Blekh Sep 08 '14 at 01:44
  • @Aaron: Thank you for your feedback and suggestion! I've heard about `brew`, when somebody recommended it to me for different task. I looked at it very briefly. I was able to solve that problem, using less complex approach. I feel that `knitr` is flexible and powerful (and complex) enough to introduce an *additional level of complexity* into my solution by adding a *templating engine*, such as `brew`. Its documentation is very limited, plus I believe most of what `brew` can do, can be done via `knitr`'s *hooks*. Additionally, I don't see how it solves the problem of *non-linear logic* in code. – Aleksandr Blekh Sep 08 '14 at 01:54
  • @Aaron: Despite poor documentation for `brew`, I just found this nice blog post with a complete example: http://learnr.wordpress.com/2009/09/09/brew-creating-repetitive-reports. Things are more clear to me now. It looks like `brew` is worth another look (unless @Yihui will show how to do what I want, using `knitr` functionality only). – Aleksandr Blekh Sep 08 '14 at 02:18
  • Please don't close this question - I've prepared a minimal reproducible example. Posting it within next 5-10 minutes. – Aleksandr Blekh Sep 08 '14 at 05:15
  • Is it OK to post a minimal reproducible example (RE) as a part of my answer? It works well, but I still would like people to comment on it to see, if some optimizations can be made. Also, I'd like to see other approaches, based on my RE, in order to decrease the granularity of code chunks. – Aleksandr Blekh Sep 08 '14 at 05:22
  • Re: Hold. I've provided a minimal RE and narrowed down the scope of my question (all in my answer below). Do you still think it's not enough or you'd be OK with me replacing the original question with contents of my current answer? Please let me know. – Aleksandr Blekh Sep 08 '14 at 08:08
  • Aleksandr - your question is still too broad. Adding the example in the answer doesn't really help. This example needs to be in the question. – Andy Clifton Sep 08 '14 at 15:01
  • @AndyClifton: All right. Then I will replace the original broad question with a more focused question with a reproducible example posted in my answer. – Aleksandr Blekh Sep 08 '14 at 15:10
  • [Saved this comment from deleted answer.] Just found this related question: stackoverflow.com/q/21729415/2872891. The answer suggests using knitr_expand(), which I wasn't aware of until now. It's nice, however, I'm not sure, if this approach is better than my simplistic one, mentioned above (even if used in combination with read_chunk()). – Aleksandr Blekh Sep 08 '14 at 15:24
  • @AndyClifton: Question reworked (moved and reworded for more focus). Answer deleted. – Aleksandr Blekh Sep 08 '14 at 15:26

1 Answers1

2

I don't understand what's non-linear about this code; perhaps because the example (thanks for that by the way) is small enough to demonstrate the code but not large enough to demonstrate the concern.

In particular, I don't understand the reason for the performEDA function. Why not put that functionality into the markdown? It would seem to be simpler and clearer to read. (This is untested...)

Let's look at summary descriptive statistics for our dataset

```{r DescriptiveDataset, include=TRUE}
print(describe(diamonds))
```

Now, let's examine each variable of interest individually.

Varible Price is ... Decriptive statistics for 'Price':

```{r DescriptivePrice, include=TRUE}
print(summary(data[["Price"]]))
```

Finally, let's examine price distribution across the dataset visually:

```{r VisualPrice, include=TRUE, fig.align='center'}
print(generatePlot(data, "Price"))
```

It looked like you were going to show the plots for all the variables; are you perhaps looking to loop there?

Also, this wouldn't change the functionality, but it would be much more within the R idiom to have performEDA return a list with the things it had created, rather than assigning into the global environment. It took me a while to figure out what the code did as those new variables didn't seem to be defined anywhere.

Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • Aaron, I appreciate your time and feedback. You're right (and I said it in my original question) that I've simplified my EDA code for producing this minimal RE a lot, maybe a bit too much. I was trying to achieve minimum amount of code for this, while trying to mimic the real code. That's exactly is reason of existence for `PerformEDA()` function in my RE. If you'd be curious to look at my real code for EDA module (https://github.com/abnova/diss-floss/blob/master/analysis/eda.R), you'd understand and will see all kinds of functions, loops and returning various lists. (to be continued) – Aleksandr Blekh Sep 09 '14 at 00:27
  • (continued) Having said that, I'm now rethinking the value and feasibility of my original approach, especially having recently discovered packages `gpairs` and, especially, `GGally` (as I using `ggplot2` instead of standard R graphics). I need to take another look at `GGally` to see, if it's easy enough to customize it for my desired EDA output. – Aleksandr Blekh Sep 09 '14 at 00:33