Clarification of copying array semantics in R on assignment to array

Question

Here is some code exploring the additional copying that can result from assigning to a cell in an array (in this case using a for loop).

# populate a vector with a million random numbers
n = 10^6
v=runif(n)
# vectorized version: fast
vv<-v*v;
m<-mean(vv); m
# for loop: slow
tracemem(vv)
for(i in 1:length(v)) { vv[i]<-v[i]*v[i] };
m<-mean(vv); m

outputs

> vv<-v*v;
> m<-mean(vv); m
[1] 0.3329162
> # for loop: slow
> tracemem(vv)
[1] "<0x000007ffff560010"
> for(i in 1:length(v)) { vv[i]<-v[i]*v[i] };
tracemem[0x000007ffff560010 -> 0x000007fffe570010]: 
> m<-mean(vv); m
[1] 0.3329162

which seems to indicate that there is a copy of the vector on the very first iteration of the loop.

Note: this is a follow-up to my earlier question Why is vectorization faster, this answer to it, and this comment on the answer.

Just to confirm the copying, I did the first iteration outside of the loop body

v=runif(n)
# vectorized version: fast
vv<-v*v;
m<-mean(vv); m
# for loop: slow
tracemem(vv)
vv[1]<-v[1]*v[1]
tracemem(vv)
for(i in 2:length(v)) { vv[i]<-v[i]*v[i] };
m<-mean(vv); m

gives this output

> vv<-v*v;
> m<-mean(vv); m
[1] 0.33385
> # for loop: slow
> tracemem(vv)
[1] "<0x000007fffef80010"
> vv[1]<-v[1]*v[1]
tracemem[0x000007fffef80010 -> 0x000007fffddc0010]: 
> tracemem(vv)
[1] "<0x000007fffddc0010"
> for(i in 2:length(v)) { vv[i]<-v[i]*v[i] };
> m<-mean(vv); m
[1] 0.33385 # (different as I generated the random nos again)

After reading joran's answer and this nabble discussion thread, I started to get familiar with the idea of R potentially copying vectors, e.g. when you change the type as below

> x = 1:10
> tracemem(x)
[1] "<0x00000000118ba4e0"
> x[5] = 6
tracemem[0x00000000118ba4e0 -> 0x0000000010d03568]: 
> x = 1:10 # starts off as integer
> tracemem(x)
[1] "<0x00000000118ba538"
> x[5] = 6L # setting integer ok
> x[5] = 6 # setting floating point changes type
tracemem[0x00000000118ba538 -> 0x0000000010d03568]: 
> x[6] = 7 # it's now floating point, setting floating point again ok
> x[7] = "asdf" # setting string changes type once more, this tanks on a large array
tracemem[0x0000000010d03568 -> 0x0000000010d03610]:

So I have a rough idea of what's going on, but why in my first example is there a copy of vv (or what mistake have I made in interpretation), when vv is already an array of floating points?

Speculation: probably to create a copy that is local to the for loop, just as an argument to a function might be copied once when handed off. (for loops, like most other things in R, are functions.) — joran, Jun 03 '13 at 22:20
@joran really interesting! Do you have a link to an accessible introduction to the functional side of R programming, or have you just picked this up organically? (My background is compsci, so by accessible I mean a technical summary of a few pages as opposed to dozens of pages requiring prerequisites.) — TooTone, Jun 03 '13 at 22:24
My last sentence above may be technically inaccurate, but it is what I have inferred from [here](http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Looping). — joran, Jun 03 '13 at 22:31
All assignments make copies of the entire object. Assignment is not "in-memory" or by-reference. That's why 'data.table' functions were invented. This has been hashed out in the r-devel mailing list many times. — IRTFM, Jun 04 '13 at 03:29

Matthew Lundberg · Accepted Answer · 2013-06-03T22:43:09.937

A copy is made because R thinks that there may be another reference to the object:

x <- 1:10
.Internal(inspect(x))
## @5a27838 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,2,3,4,5,...
# NAM(1) means that there is one reference to the object.

tracemem(x)
## [1] "<0x05a27838>"
.Internal(inspect(x))
## @5a27838 13 INTSXP g0c4 [NAM(1),TR] (len=10, tl=0) 1,2,3,4,5,...
# Still one reference

mean(x)
## [1] 5.5
.Internal(inspect(x))
## @5a27838 13 INTSXP g0c4 [NAM(2),TR] (len=10, tl=0) 1,2,3,4,5,...
# NAM(2) means "more than one" reference.
# A copy of the "pointer" was taken to pass to "mean", which bumped the count.
# The count starts at (essentially) 1, and is set to 2 if a copy is made.  Never back to 1 though.

x[1] <- 0
tracemem[0x05a27838 -> 0x05a278c8]: 
tracemem[0x05a278c8 -> 0x05a0d6f0]:

An assignment doesn't actually copy data (until a modification is made). Rather, it makes a copy of the pointer and indicates that none are singletons:

x <- 1
y <- x
.Internal(inspect(x))
## @5a61848 14 REALSXP g0c1 [NAM(2)] (len=1, tl=0) 1
.Internal(inspect(y))
## @5a61848 14 REALSXP g0c1 [NAM(2)] (len=1, tl=0) 1
y[1] <- 1
.Internal(inspect(y))
## @5a61948 14 REALSXP g0c1 [NAM(1)] (len=1, tl=0) 1
# Note, a new memory address, and NAM(1).

Note that you'll get different results if you're running within RStudio. — Matthew Lundberg, Jun 03 '13 at 22:47
V comprehensive. I just reran my code and took out the `m<-mean(vv); m` on line 6 after `vv<-v*v`, and the `tracemem[addr1->addr2]` output disappeared, indicating that there is only a single copy. (From what I've read elsewhere, the additional copy in the original code would eventually be garbage collected.) — TooTone, Jun 04 '13 at 12:48

Clarification of copying array semantics in R on assignment to array

1 Answers1

Linked