6

I can't seem to understand the following example in Advanced R.

x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
medians <- vapply(x, median, numeric(1))

y <- as.list(x)
cat(tracemem(y), "\n")
#> <0x7f80c5c3de20>

for (i in 1:5) {
  y[[i]] <- y[[i]] - medians[[i]]
}
#> tracemem[0x7f80c5c3de20 -> 0x7f80c48de210]: 

I don't understand why a copy would be made in this case, since "If an object has a single name bound to it, R will modify it in place" and the object referenced by y indeed has only a single name y bound to it.

J. Mini
  • 1,868
  • 1
  • 9
  • 38
lyh970817
  • 71
  • 2
  • Are you trying this in Rstudio? The Rstudio environmetn browser holds on to references of objects to be able to display them. If you try this in base R you probably won't see the issue. – MrFlick May 16 '20 at 22:49
  • If you are using Rstudio, this is basically a duplicate of https://stackoverflow.com/questions/15559387/operator-in-rstudio-and-r – MrFlick May 16 '20 at 22:50
  • Please give a complete minimal example. The following produces the same memory address for both `cat`s: `x <- 1:5; y <- as.list(x); cat(tracemem(y), "\n"); for (i in 1:5) y[[i]] <- y[[i]] - 2; cat(tracemem(y), "\n")` – Xu Wang May 16 '20 at 22:54
  • @MrFlick I copied this code directly from Section 2.5.1 from Advanced R. The difference with Rstudio was indeed mentioned in the same chapter so I'm thinking Hadley should not have run this in Rstudio. – lyh970817 May 17 '20 at 06:04
  • @XuWang Thanks. I have added a complete minimal example copied from the book. – lyh970817 May 17 '20 at 06:04
  • @lyh970817 thanks, I do not know the answer to the question. But since you now gave a MWE I give a +1. Good luck! – Xu Wang May 22 '20 at 01:56

2 Answers2

10

While the commentary regarding RStudio references is probably true, it appears as though the book is outdated.

The last commit on the source code for that page was on 2019-06-25 - a date that predates the release of R v4.0.0.

If you check the change log for R, you will find the following change listed in v4.0.0:

Reference counting is now used instead of the NAMED mechanism for determining when objects can be safely mutated in base C code. This reduces the need for copying in some cases and should allow further optimizations in the future. It should help make the internal code easier to maintain.

R v3.6.3

Indeed, if you run the example code under R v3.6.3 (the version just prior to v4.0.0):

#> R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
#> Copyright (C) 2020 The R Foundation for Statistical Computing
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> 
#> R is free software and comes with ABSOLUTELY NO WARRANTY.
#> You are welcome to redistribute it under certain conditions.
#> Type 'license()' or 'licence()' for distribution details.
#> 
#>   Natural language support but running in an English locale
#> 
#> R is a collaborative project with many contributors.
#> Type 'contributors()' for more information and
#> 'citation()' on how to cite R or R packages in publications.
#> 
#> Type 'demo()' for some demos, 'help()' for on-line help, or
#> 'help.start()' for an HTML browser interface to help.
#> Type 'q()' to quit R.

x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
medians <- vapply(x, median, numeric(1))

for (i in seq_along(medians)) {
  x[[i]] <- x[[i]] - medians[[i]]
}

cat(tracemem(x), "\n")
#> <000000002457F7D0> 

for (i in 1:5) {
  x[[i]] <- x[[i]] - medians[[i]]
}
#> tracemem[0x000000002457f7d0 -> 0x0000000024697c90]: 
#> tracemem[0x0000000024697c90 -> 0x0000000024697c20]: [[<-.data.frame [[<- 
#> tracemem[0x0000000024697c20 -> 0x0000000024697bb0]: [[<-.data.frame [[<- 
#> tracemem[0x0000000024697bb0 -> 0x0000000024697b40]: 
#> tracemem[0x0000000024697b40 -> 0x0000000024697ad0]: [[<-.data.frame [[<- 
#> tracemem[0x0000000024697ad0 -> 0x0000000024697a60]: [[<-.data.frame [[<- 
#> tracemem[0x0000000024697a60 -> 0x00000000246979f0]: 
#> tracemem[0x00000000246979f0 -> 0x0000000024697980]: [[<-.data.frame [[<- 
#> tracemem[0x0000000024697980 -> 0x0000000024697910]: [[<-.data.frame [[<- 
#> tracemem[0x0000000024697910 -> 0x00000000246978a0]: 
#> tracemem[0x00000000246978a0 -> 0x0000000024697830]: [[<-.data.frame [[<- 
#> tracemem[0x0000000024697830 -> 0x00000000246977c0]: [[<-.data.frame [[<- 
#> tracemem[0x00000000246977c0 -> 0x0000000024697750]: 
#> tracemem[0x0000000024697750 -> 0x00000000246976e0]: [[<-.data.frame [[<- 
#> tracemem[0x00000000246976e0 -> 0x0000000024697670]: [[<-.data.frame [[<- 

untracemem(x)

y <- as.list(x)
cat(tracemem(y), "\n")
#> <0000000024697600> 
 
for (i in 1:5) {
  y[[i]] <- y[[i]] - medians[[i]]
}
#> tracemem[0x0000000024697600 -> 0x00000000247ec708]:

untracemem(y)

We observe the 15 copies being made for the dataframe and the one copy for the list as per the book.

R v4.0.0

However, if we run the same example code under R v4.0.0:

#> R version 4.0.0 (2020-04-24) -- "Arbor Day"
#> Copyright (C) 2020 The R Foundation for Statistical Computing
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> 
#> R is free software and comes with ABSOLUTELY NO WARRANTY.
#> You are welcome to redistribute it under certain conditions.
#> Type 'license()' or 'licence()' for distribution details.
#> 
#>   Natural language support but running in an English locale
#> 
#> R is a collaborative project with many contributors.
#> Type 'contributors()' for more information and
#> 'citation()' on how to cite R or R packages in publications.
#> 
#> Type 'demo()' for some demos, 'help()' for on-line help, or
#> 'help.start()' for an HTML browser interface to help.
#> Type 'q()' to quit R.

x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
medians <- vapply(x, median, numeric(1))

for (i in seq_along(medians)) {
  x[[i]] <- x[[i]] - medians[[i]]
}

cat(tracemem(x), "\n")
#> <00000000236B0C50> 

for (i in 1:5) {
  x[[i]] <- x[[i]] - medians[[i]]
}
#> tracemem[0x00000000236b0c50 -> 0x00000000237a7a90]: 
#> tracemem[0x00000000237a7a90 -> 0x00000000237a7a20]: [[<-.data.frame [[<- 
#> tracemem[0x00000000237a7a20 -> 0x00000000237a79b0]: 
#> tracemem[0x00000000237a79b0 -> 0x00000000237a7940]: [[<-.data.frame [[<- 
#> tracemem[0x00000000237a7940 -> 0x00000000237a78d0]: 
#> tracemem[0x00000000237a78d0 -> 0x00000000237a7860]: [[<-.data.frame [[<- 
#> tracemem[0x00000000237a7860 -> 0x00000000237a77f0]: 
#> tracemem[0x00000000237a77f0 -> 0x00000000237a7780]: [[<-.data.frame [[<- 
#> tracemem[0x00000000237a7780 -> 0x00000000237a7710]: 
#> tracemem[0x00000000237a7710 -> 0x00000000237a76a0]: [[<-.data.frame [[<- 

untracemem(x)

y <- as.list(x)
cat(tracemem(y), "\n")
#> <00000000237A7630> 

for (i in 1:5) {
  y[[i]] <- y[[i]] - medians[[i]]
}

untracemem(y)

We observe the effects of the change in reducing the number of copies performed. The copies for the dataframe have gone from 15 to 10 and there is no copy performed for the list anymore.

To answer OP's question directly, the copy was being made unnecessarily per the NAMED mechanism. However, the change to reference counting in R v4.0.0 prevents the unnecessary copy, and the object is now modified in place as expected.

the-mad-statter
  • 5,650
  • 1
  • 10
  • 20
  • I don't think that the chapter of Advanced R in question explains the NAMED mechanism. Or at the very least, it doesn't call it that. It might be more helpful to write your last paragraph in terms that the readers of Section 2.5 might understand. – J. Mini Mar 05 '21 at 18:00
  • I would describe the NAMED mechanism, but I am not sure I understand it well enough to do so succinctly. Here is a [link](https://developer.r-project.org/Refcnt.html) to additional information provided in footnote 15 of the book. – the-mad-statter Mar 06 '21 at 17:05
0

As pointed out in the comments, if you run the code in R and not RStudio, you won't see a change:

x <- data.frame(matrix(runif(5*1e2), ncol = 5))
medians <- vapply(x, median, numeric(1))
y <- as.list(x)
> cat(tracemem(y), "\n")
<0000000018A3BB80>
for (i in 1:5) {
  y[[i]] <- y[[i]] - medians[[i]]
}
> cat(tracemem(y), "\n")
<0000000018A3BB80>
tester
  • 1,662
  • 1
  • 10
  • 16
  • I disagree. Running the same code as what the book gives in R, I get 10 tracemem outputs. That is less than the book's 15, but does that mean that the book is totally wrong? It claims "In fact, each iteration copies the data frame not once, not twice, but three times!". Is it just things aren't as bad as the book says, or is the book outright wrong? – J. Mini Feb 27 '21 at 17:09