-2

I benchmarked a few solutions for replacing missing values per column.

set.seed(11)
df <- data.frame(replicate(3, sample(c(1:5, -99), 6, rep = TRUE)))
names(df) <- letters[1:3]

fix_na <- function(x) {
  x[x == -99] <- NA
}

microbenchmark(
  for(i in seq_along(df)) df[, i] <- fix_na(df[, i]),
  for(i in seq_along(df)) df[[i]] <- fix_na(df[[i]]),
  df[] <- lapply(df, fix_na)
)

Unit: microseconds
                                                     expr     min       lq     mean      median   uq     max neval
 for (i in seq_along(df)) df[, i] <- fix_na(df[, i]) 179.167 191.9060 206.1650 204.2335 211.630 364.497   100
 for (i in seq_along(df)) df[[i]] <- fix_na(df[[i]])  83.420  92.8715 104.5787  98.0080 109.309 204.645   100
                          df[] <- lapply(df, fix_na) 105.199 113.4175 128.0265 117.9385 126.979 305.734   100

Why is the [[]] operator subsetting the dataframe 2x faster than the [,] operator?

EDIT

I included the two recommended calls from docendo discimus and increased the amount of data.

set.seed(11)
df1 <- data.frame(replicate(2000, sample(c(1:5, -99), 500, rep = TRUE)))
df2 <- df1
df3 <- df1
df4 <- df1
df5 <- df1

The results change yes, but my question still is there: [[]] performs faster than [,]

Unit: milliseconds
                                                        expr       min        lq       mean        median      uq 
 for (i in seq_along(df1)) df1[, i] <- fix_na(df1[, i]) 301.06608 356.48011 377.31592 372.05625 392.73450 472.3330
 for (i in seq_along(df2)) df2[[i]] <- fix_na(df2[[i]]) 238.72005 287.55364 301.35651 298.05950 314.04369 386.4288
                           df3[] <- lapply(df3, fix_na) 170.53264 189.83858 198.32358 193.43300 202.43855 284.1164
                                 df4[df4 == -99] <- NA  75.05571  77.64787  85.59757  80.72697  85.16831  363.2223
                              is.na(df5) <- df5 == -99  74.44877  77.81799  84.22055  80.06496  83.01401  347.5798
Tobi_R
  • 13
  • 2
  • 2
    If you are benchmarking on a small dataset, it doesn't give the correct output – akrun Jul 08 '16 at 07:53
  • Possible duplicate of [R: Why is the \[\[ \]\] approach for subsetting a list faster than using $?](http://stackoverflow.com/questions/16630087/r-why-is-the-approach-for-subsetting-a-list-faster-than-using) – ArunK Jul 08 '16 at 08:04
  • You can add two more approaches to your benchmark: `df[df == -99] <- NA` and `is.na(df) <- df == -99` – talat Jul 08 '16 at 08:08
  • @Arun Thanks for the hint. But as far as I know, the $ operator is short for [["x", exact = FALSE]]. So it does not really help in comparision with the [,] operator, or? – Tobi_R Jul 08 '16 at 08:37
  • @Tobi_R. As I understand it has all got to do with partial matching theory (personally haven't had a chance to explore that deeply, this is hidden in the comments discussion). Also, `$` and `[[` are implemented using c functions, Hadley has written a nice description on the different methods to subsetting http://adv-r.had.co.nz/Subsetting.html. Albeit the link doesn't have any benchmark results. Another good description on the performances of R is http://adv-r.had.co.nz/Performance.html – ArunK Jul 08 '16 at 09:19
  • Thanks for your suggestions @Arun. The second link gives an answer to a very similar question. – Tobi_R Jul 08 '16 at 09:52
  • 1
    In your question you are _not_ comparing `[` VS `[[`. You _could_ be comparing `[.data.frame` VS `[[.data.frame` but you are actually, also, comparing `[<-.data.frame` VS `[[<-.data.frame`. You could scan through those functions and find what probably -if anything- adds computational time depending on the number of arguments etc. – alexis_laz Jul 08 '16 at 10:13

2 Answers2

0

A faster approach would be using set from data.table

 library(data.table)
 setDT(df)
 for(j in seq_along(df)){
  set(df, i = which(df[[j]]== -99), j=j, value = NA)
 }

Regarding the OP's question about the benchmarking with [ and [[, the [[ extract the column without the overhead of .data.frame. But, I would benchmark on a bigger dataset to find any difference. Also, as we assign NA on the same data, it doesn't do any change when we are doing the operation again.

Benchmarks

set.seed(11)
df1 <- data.frame(replicate(2000, sample(c(1:5, -99), 500, rep = TRUE)))
df2 <- copy(df1)
df3 <- copy(df1)
df4 <- copy(df1)
df5 <- copy(df1)
df6  <- copy(df1)

 f1 <- function() for (i in seq_along(df1)) df1[, i] <- fix_na(df1[, i])
 f2 <- function() for (i in seq_along(df2)) df2[[i]] <- fix_na(df1[[i]])
 f3 <- function()  df3[] <- lapply(df3, fix_na)
 f4 <- function()  df4[df4 == -99] <- NA 
 f5 <- function()   is.na(df5) <- df5 == -99

 f6 <- function() {
   setDT(df6)
   for(j in seq_along(df)){
     set(df, i = which(df[[j]]== -99), j=j, value = NA)
   }  
  }

 t(sapply(paste0("f", 1:6), function(f) system.time(get(f)())))[,1:3]
 #   user.self sys.self elapsed
 #f1      0.29        0    0.30
 #f2      0.22        0    0.22
 #f3      0.11        0    0.11
 #f4      0.31        0    0.31
 #f5      0.31        0    0.32
 #f6      0.00        0    0.00

Here, I am using the system.time as the functions in the OP's post already replace the value of NA in the first run, so there is no point in running it again and again.

Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 5
    the OP ask why meth1 is faster than meth2 with a sample too small to conclude anyway. Your answer is "meth3 is faster" with a benchmark on meth3 in a decent sample. I don't find this answering the OP question. This is my opinion, you may disagree – Cath Jul 08 '16 at 08:17
  • 9
    **Moderator note**: Stop arguing over voting; if you want to discuss voting behaviour, either take it to Meta or chat. – Martijn Pieters Jul 08 '16 at 08:28
  • Thank you for your faster solution akrun. I will keep it in mind. Why do you think that [ does opperate with the overhead of a data.frame? `Drop = True`, thus both results `df[, i]` and `df[[i]]` are vecors – Tobi_R Jul 08 '16 at 09:49
  • @Tobi_R The `[[` is used for subsetting a single column or a single list element. By using `[`, it can be used for subsetting multiple columns and with `,`, the rows also comes into picture. Though, we are making it blank on the lhs of `,`, I am guessing that it will still go check the row part. – akrun Jul 08 '16 at 10:06
  • 3
    `[[`, also, dispatches on its "data.frame" method and it, also, accepts argument for rows `mtcars[2, 6]` `mtcars[[2, 6]]` – alexis_laz Jul 08 '16 at 10:16
-1

Got an answer for a very similar problem on the site suggested from Arun: adv-r.had.co.nz/Performance.html

At the section Extracting a single value from a data frame it says:

Blockquote The following microbenchmark shows seven ways to access a single value (the number in the bottom-right corner) from the built-in mtcars dataset. The variation in performance is startling: the slowest method takes 30x longer than the fastest. There’s no reason that there has to be such a huge difference in performance. It’s simply that no one has had the time to fix it.

Among the different selection methods also the two operators [[ and [ are compared with the same results as observed by me. [[ outperforms [

Tobi_R
  • 13
  • 2
  • 1
    Note that in your question you are not only _extracting_ values but, you're, also, _assigning_ values to "data.frame" – alexis_laz Jul 08 '16 at 10:21
  • Well you are absolutly right. I did not see the _assigning_ part as a part of my stated problem. Thanks for pointing that out. – Tobi_R Jul 08 '16 at 11:21