I benchmarked a few solutions for replacing missing values per column.
set.seed(11)
df <- data.frame(replicate(3, sample(c(1:5, -99), 6, rep = TRUE)))
names(df) <- letters[1:3]
fix_na <- function(x) {
x[x == -99] <- NA
}
microbenchmark(
for(i in seq_along(df)) df[, i] <- fix_na(df[, i]),
for(i in seq_along(df)) df[[i]] <- fix_na(df[[i]]),
df[] <- lapply(df, fix_na)
)
Unit: microseconds
expr min lq mean median uq max neval
for (i in seq_along(df)) df[, i] <- fix_na(df[, i]) 179.167 191.9060 206.1650 204.2335 211.630 364.497 100
for (i in seq_along(df)) df[[i]] <- fix_na(df[[i]]) 83.420 92.8715 104.5787 98.0080 109.309 204.645 100
df[] <- lapply(df, fix_na) 105.199 113.4175 128.0265 117.9385 126.979 305.734 100
Why is the [[]] operator subsetting the dataframe 2x faster than the [,] operator?
EDIT
I included the two recommended calls from docendo discimus and increased the amount of data.
set.seed(11)
df1 <- data.frame(replicate(2000, sample(c(1:5, -99), 500, rep = TRUE)))
df2 <- df1
df3 <- df1
df4 <- df1
df5 <- df1
The results change yes, but my question still is there: [[]] performs faster than [,]
Unit: milliseconds
expr min lq mean median uq
for (i in seq_along(df1)) df1[, i] <- fix_na(df1[, i]) 301.06608 356.48011 377.31592 372.05625 392.73450 472.3330
for (i in seq_along(df2)) df2[[i]] <- fix_na(df2[[i]]) 238.72005 287.55364 301.35651 298.05950 314.04369 386.4288
df3[] <- lapply(df3, fix_na) 170.53264 189.83858 198.32358 193.43300 202.43855 284.1164
df4[df4 == -99] <- NA 75.05571 77.64787 85.59757 80.72697 85.16831 363.2223
is.na(df5) <- df5 == -99 74.44877 77.81799 84.22055 80.06496 83.01401 347.5798