19

This is related in spirit to this question, but must be different in mechanism.

If you try to cache a knitr chunk that contains a data.table := assignement then it acts as though that chunk has not been run, and later chunks do not see the affect of the :=.

Any idea why this is? How does knitr detect objects have updated, and what is data.table doing that confuses it?

It appears you can work around this by doing DT = DT[, LHS:=RHS].

Example:

```{r}
library(data.table)
```
Data.Table Markdown
========================================================
Suppose we make a `data.table` in **R Markdown**
```{r, cache=TRUE}
DT = data.table(a = rnorm(10))
```
Then add a column using `:=`
```{r, cache=TRUE}
DT[, c:=5] 
```
Then we display that in a non-cached block
```{r, cache=FALSE}
DT
```
The first time you run this, the above will show a `c` column, 
from the second time onwards it will not.

Output on second run

knitr output

Community
  • 1
  • 1
Corvus
  • 7,548
  • 9
  • 42
  • 68
  • +1 I've got no inkling about this I'm afraid. When you say the "second time onwards" do you mean a repeat of `DT`, a repeat of `DT` inside the `cache=FALSE` block, or a rerun of the script? There's nothing after "output on second run" - is that the point i.e. it's completely blank or did you forget to paste something there. Try inspecting the object with `.Internal(inspect(DT))` at various points. How is the `knitr` cache implemented? – Matt Dowle Mar 08 '13 at 17:11
  • @MatthewDowle -- It's a bit speculative (b/c I didn't feel like delving into **knitr**'s caching mechanism) but I suspect my answer below gets at least the big picture right. – Josh O'Brien Mar 08 '13 at 17:20
  • @JoshO'Brien Cool, sounds rights to me, thanks. Will aim to come back to it and change either `knitr` or `data.table` to play nice together, but this solution is nice in the meantime. – Matt Dowle Mar 08 '13 at 17:54
  • 1
    @MatthewDowle -- Seems to me it's better fixed on the **knitr** side, and Yihui seems to agree. BTW, many thanks for making the changes needed to get **data.table** working under R-3.0.0! Was getting rid of all the non-API calls a lot of work? – Josh O'Brien Mar 08 '13 at 19:37
  • @JoshO'Brien No problem. Not really, just a few hours. Brian Ripley helped a lot by letting me know Cstack_info() existed. I would have been stuck a long time without that tip. – Matt Dowle Mar 08 '13 at 19:50

2 Answers2

19

Speculation:

Here is what appears to be going on.

knitr quite sensibly caches objects as as soon as they are created. It then updates their cached value whenever it detects that they have been altered.

data.table, though, bypasses R's normal copy-by-value assignment and replacement mechanisms, and uses a := operator rather than a =, <<-, or <-. As a result knitr isn't picking up the signals that DT has been changed by DT[, c:=5].

Solution:

Just add this block to your code wherever you'd like the current value of DT to be re-cached. It won't cost you anything memory or time-wise (since nothing except a reference is copied by DT <- DT) but it does effectively send a (fake) signal to knitr that DT has been updated:

```{r, cache=TRUE, echo=FALSE}
DT <- DT 
```

Working version of example doc:

Check that it works by running this edited version of your doc:

```{r}
library(data.table)
```
Data.Table Markdown
========================================================
Suppose we make a `data.table` in **R Markdown**
```{r, cache=TRUE}
DT = data.table(a = rnorm(10))
```

Then add a column using `:=`
```{r, cache=TRUE}
DT[, c:=5] 
```

```{r, cache=TRUE, echo=FALSE}
DT <- DT 
```

Then we display that in a non-cached block
```{r, cache=FALSE}
DT
```
The first time you run this, the above will show a `c` column. 
The second, third, and nth times, it will as well.
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Thanks for that - any guesses what would be involved in making `knitr` pick up on `:=`. Would it be a matter of making `data.table` give the correct signals, or making `knitr` watch out for `:=`? – Corvus Mar 08 '13 at 18:02
  • 3
    @Corone in this case it sounds like a good idea to manually assign the object names to `knitr` to cache; I can consider it if you file a feature request: https://github.com/yihui/knitr/issues – Yihui Xie Mar 08 '13 at 18:15
  • @Yihui Hi. If there's something I can change on the `data.table` side just let me know. – Matt Dowle Mar 09 '13 at 22:45
  • @MatthewDowle thanks, but it is probably hard; what I want is `x=data.table(a=1:5); ev=new.env(); eval(quote({x[, b:=5]}), envir=ev)` and `x` should also appear in the environment `ev`, but that apparently contradicts the philosophy of `data.table`. I have one possible solution in mind and I'll think more about it. – Yihui Xie Mar 10 '13 at 02:27
  • @Yihui Interesting, yes I ran that and see what you mean. Yes `data.table`'s `:=` is assignment by reference to wherever `x` is. The very purpose of `:=` is not to copy-on-write. Could the `x=data.table(a=1:5)` bit also be eval'd in the same `ev`? – Matt Dowle Mar 12 '13 at 11:28
  • @MatthewDowle yes, but there is no guarantee that `x[, :=]` is always used in the same chunk as `x <- data.table()`; for cache, the code is evaluated in a separate empty environment so I know which variables are created in a code chunk – Yihui Xie Mar 12 '13 at 13:36
  • @Yihui Ah, I see. Is there a way from inside `x[,:=]` that I can tell that `knitr` is calling me? I could change the `if (Cstack_info(...` in [this answer](http://stackoverflow.com/a/15268392/403310) to add a clause to look for a special symbol in your environment or something as well? So that when I'm called from `knitr` it's as though I'm at the prompt. – Matt Dowle Mar 12 '13 at 13:47
  • @MatthewDowle -- How hard would it be for you to add a switch that did the equivalent of `DT <- DT` or `assign(deparse(substitute(x)), x, envir=environment())` each time any `[.data.frame` operation is carried out? Then all **knitr** (or **evaluate**) would have to do is indicate that it wanted to run **data.table** with that switch turned on, and the needed copying would always take place. – Josh O'Brien Mar 12 '13 at 13:48
  • @JoshO'Brien Yes that sounds ok. An option where I can detect `knitr` is calling me sounds more future proof than `data.table` providing an option that `knitr` needs to set. – Matt Dowle Mar 12 '13 at 13:53
  • @MatthewDowle -- Agreed. (We kind of cross-posted there, and I like your idea better.) Although, if I understand correctly, such an operation (`DT <- DT`) is essentially costless, and a third possibility is to just make it the default action. Not sure if that would be stylistically offensive, and we'd definitely have to consider whether it could ever have an undesirable side-effect or break **data.table**'s behavior in any way. – Josh O'Brien Mar 12 '13 at 14:00
  • @JoshO'Brien Ah I see. There's a subtle edge case. Inside a function I often add columns to `DT` in `.GlobalEnv` by reference, several times. I don't want that to bind `DT` in the function's scope. It would be ok, until, all the over-allocated column slots were used up, at which point the shallow copy to create more slots would not apply to the `DT` in `.GlobalEnv`. A rare edge case. Otherwise I suppose `data.table` could always `DT<-DT`. Hadn't thought of that before. – Matt Dowle Mar 12 '13 at 14:04
  • @JoshO'Brien I've sometimes wondered if there's a way to find all symbols (in any environment) that point to a particular object. If I could do that it opens up new possibilities. I guess it must be possible, if the internal R structures are readable from a package (which they probably aren't in R3). – Matt Dowle Mar 12 '13 at 14:09
  • @MatthewDowle That sounds like a good R-devel question. Re your other question, you should be able to tell that knitr is calling you. To begin exploring, `knit()` a document with one chunk containing just this code: `sys.calls()`, and then look at all the evidence of knitr's presence that's available. Yihui will have a more incisive idea, though ;) – Josh O'Brien Mar 12 '13 at 14:17
  • 2
    @MatthewDowle -- Something like this should work: `"evaluate" %in% sapply(sys.calls(), function(X) deparse(X[[1]]))`. (Searching the call stack for a call to `evaluate` might be safer than looking for `knit`, because it'll work better in cases where the call to `knit` was constructed 'programatically', as in `do.call(knit, ...)` or `sapply(..., knit)`, etc.) – Josh O'Brien Mar 12 '13 at 16:48
  • Filed as [#904](https://github.com/Rdatatable/data.table/issues/904) to revisit the caching issue. The printing issue though (linked question) is now properly fixed in v1.9.5. – Matt Dowle Oct 21 '14 at 20:18
11

As indicated in the fourth comment under the answer by Josh O'Brien, I have added a new chunk option cache.vars to handle this very special case. In the second cached chunk, we can specify cache.vars='DT' so that knitr will save a copy of DT.

```{r}
library(data.table)
```
Data.Table Markdown
========================================================
Suppose we make a `data.table` in **R Markdown**
```{r, cache=TRUE}
DT = data.table(a = rnorm(10))
```
Then add a column using `:=`
```{r, cache=TRUE, cache.vars='DT'}
DT[, c:=5] 
```
Then we display that in a non-cached block
```{r, cache=FALSE}
DT
```

The output is like this no matter how many times you compile the document:

knitr works with data.table now

Yihui Xie
  • 28,913
  • 23
  • 193
  • 419