Why does selecting column(s) from a data.table results in a copy?

Question

It appears that selecting column(s) from the data.table with [.data.table results in a copy of the underlying vector(s). I am talking about very simple column selection, by name, there are no expressions to compute in j and there are no rows to subset in i. Even more strangely, the column subsetting in a data.frame appears to not make any copies. I am using the data.table version data.table 1.10.4. A simple example with details and benchmarks is provided below. My questions are:

Am I doing something wrong?
Is this a bug or is this the intended behavior?
If this is intended, what is the best approach to subset a data.table by columns and avoid extra copy?

The intended use-case involves large dataset, so avoiding extra copies is a must (especially since base R seems to already support this).

library(data.table)
set.seed(12345)
cpp_dt <- data.table(a = runif(1e6), b = rnorm(1e6), c = runif(1e6))
cols=c("a","c")

## naive / data.frame style of column selection
## leads to a copy of the column vectors in cols
subset_cols_1=function(dt,cols){
  return(dt[,cols,with=F])
}

## alternative syntax, still results in a copy
subset_cols_2=function(dt,cols){
  return(dt[,..cols])
}

## work-around that uses data.frame column selection,
## appears to avoid the copy
subset_cols_3=function(dt,cols){
  setDF(dt)
  subset=dt[,cols]
  setDT(subset)
  setDT(dt)
  return(subset)
}

## another approach that makes a "shallow" copy of the data.table
## then NULLs the not needed columns by reference
## appears to also avoid the copy
subset_cols_4=function(dt,cols){
  subset=dt[TRUE]
  other_cols=setdiff(names(subset),cols)
  set(subset,j=other_cols,value=NULL)
  return(subset)
}

subset_1=subset_cols_1(cpp_dt,cols)
subset_2=subset_cols_2(cpp_dt,cols)
subset_3=subset_cols_3(cpp_dt,cols)
subset_4=subset_cols_4(cpp_dt,cols)

Now lets look at the memory allocation and compare to original data.

.Internal(inspect(cpp_dt)) # original data, keep an eye on 1st and 3d vector
# @7fe8ba278800 19 VECSXP g1c7 [OBJ,MARK,NAM(2),ATT] (len=3, tl=1027)
#   @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @10f1a3000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) -0.947317,-0.636669,0.167872,-0.206986,0.411445,...
#   @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

Using [.data.table method to subset the columns:

.Internal(inspect(subset_1)) # looks like data.table is making a copy
# @7fe8b9f3b800 19 VECSXP g0c7 [OBJ,NAM(1),ATT] (len=2, tl=1026)
#   @114cb0000 14 REALSXP g0c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @1121ca000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

Another syntax version that still uses [.data.table and still making a copy:

.Internal(inspect(subset_2)) # same, still copy
# @7fe8b6402600 19 VECSXP g0c7 [OBJ,NAM(1),ATT] (len=2, tl=1026)
#   @115452000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @1100e7000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

Using a sequence of setDF, followed by [.data.frame and setDT. Look, the vectors a and c are no longer copied! It appears that base R method is more efficient / has smaller memory footprint?

.Internal(inspect(subset_3)) # "[.data.frame" is not making a copy!!
# @7fe8b633f400 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=1026)
#   @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

Another approach is to make a shallow copy of the data.table, then NULL all the extra columns by reference in the new data.table. Again no copies are made.

.Internal(inspect(subset_4)) # 4th approach seems to also avoid the copy
# @7fe8b924d800 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=1027)
#   @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

Now lets look at the benchmarks of these four approaches. It looks like "[.data.frame" (subset_cols_3) is a clear winner.

microbenchmark({subset_cols_1(cpp_dt,cols)},
               {subset_cols_2(cpp_dt,cols)},
               {subset_cols_3(cpp_dt,cols)},
               {subset_cols_4(cpp_dt,cols)},
               times=100)

# Unit: microseconds
#                                 expr      min        lq      mean   median        uq       max neval
#  {     subset_cols_1(cpp_dt, cols) } 4772.092 5128.7395 8956.7398 7149.447 10189.397 53117.358   100
#  {     subset_cols_2(cpp_dt, cols) } 4705.383 5107.1690 8977.1816 6680.666  9206.164 53523.191   100
#  {     subset_cols_3(cpp_dt, cols) }  148.659  177.9595  285.4926  250.620   283.414  4422.968   100
#  {     subset_cols_4(cpp_dt, cols) }  193.912  241.9010  531.8308  336.467   384.844 20061.864   100

Maybe just wait for updates here: https://stackoverflow.com/a/26481429/ The `shallow` function is not exported yet but would probably help with this. — Frank, Aug 25 '17 at 03:00

score 6 · Answer 1 · answered Aug 26 '17 at 02:54

It's been a while since I thought about this, but here goes.

Good question. But why do you need to subset a data.table like that? We really need to see what you are doing next: the bigger picture. It's that bigger picture that we probably have a different way for in data.table than the base R idiom.

Roughly illustrating with probably a bad example :

DT[region=="EU", lapply(.SD, sum), .SDcols=10:20]

rather than the base R idiom of taking a subset and then doing something next (here, apply) on the result outside :

apply(DT[DT$region=="EU", 10:20], 2, sum)

In general, we want to encourage doing as much as possible inside one [...] so that data.table sees the i, j and by together in one [...] operation and can optimize the combination. When you subset columns and then do the next thing outside afterwards it requires more software complexity to optimize. In most cases, most of the computational cost is inside the first [...] which reduces to a relatively insignificant size.

With that said, in addition to Frank's comment about shallow, we're also waiting to see how the ALTREP project pans out. That improves reference counting in base R and may enable := to know reliably whether a column it is operating on needs to be copy-on-write first or not. Currently,:= always updates by reference so it would update both data.table's if selecting-some-whole-columns did not take a deep copy (it is deliberate that it does copy, for that reason). If := is not used inside [...] then [...] always returns a new result which is safe to use := on, which is quite a straightforward rule currently. Even if all you're doing is selecting a few whole columns for some reason.

We really need to see the bigger picture please: what you're doing afterwards on the subset of columns. Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count for this case.

Why does selecting column(s) from a data.table results in a copy?

1 Answers1

Linked