Problem
I have this function that I need to make it go faster :)
if (length(vec) == 0) { # first case
count = sum(apply(df, 1, function(x) {
all(x == 0, na.rm = T)
}))
} else if (length(vec) == 1) { # second case
count = sum(df[, vec], na.rm = T)
} else {
count = sum(apply(df[, vec], 1, function(x) { # third case
all(x == 1) }), na.rm = T)
}
df
is a data.frame
with only 1, 0 or NA values. vec
is a sub-vector of the colnames(df)
.
- First case: count the rows thta after the NA's are removed, they have only 0's (or nothing - e.g. the row had only NA's - you count it too)
- Second case: count the 1's in the vector (1 column chosen only) after removing the NA's
- Third case: from the filtered data.frame get the number of rows that have all their values equal to 1.
Question
Is there any way you think that can make this code run faster using dplyr
or something else since it manipulates the data frame by row? For example, when I exchanged the easier one (2nd case) - count = sum(df[, vec], na.rm = T)
with dplyr
: sum(df %>% select(vec), na.rm = T)
and did a benchmark, it was considerably worse (but ok I don't think 2nd case can get considerably faster with any method).
Any tips or tricks for 2st and 3rd cases are welcome!
Benchmarking
A huge enough data.frame to play with: df = matrix(data = sample(c(0,1,NA), size = 100000, replace = TRUE), nrow = 10000, ncol = 10)
.
- The first case:
rbenchmark::benchmark("prev" = {sum(apply(df, 1, function(x) {all(x == 0, na.rm = T)}))}, "new-long" = {sum((rowSums(df == 0, na.rm = TRUE) + rowSums(is.na(df)) == ncol(df)))}, "new-short" = {sum(!rowSums(df != 0, na.rm = TRUE))}, replications = 1000, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
Results:
test replications elapsed relative user.self sys.self
2 new-long 1000 1.267 1.412 1.267 0
3 new-short 1000 0.897 1.000 0.897 0
1 prev 1000 11.857 13.219 11.859 0
- The third case (
vec = 1:5
for example):
rbenchmark::benchmark("prev" = {sum(apply(df[, vec], 1, function(x) { all(x == 1) }), na.rm = T)}, "new" = {sum(!rowSums(replace(df[, vec], is.na(df[, vec]), -999) != 1))}, replications = 1000, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
Results:
test replications elapsed relative user.self sys.self
2 new 1000 0.179 1.000 0.175 0.004
1 prev 1000 2.219 12.397 2.219 0.000
Overall, nice speedup using the rowSums
! Use it too instead of apply
!