1

Problem

I have this function that I need to make it go faster :)

if (length(vec) == 0) { # first case
  count = sum(apply(df, 1, function(x) {
    all(x == 0, na.rm = T)
  }))
} else if (length(vec) == 1) { # second case
  count = sum(df[, vec], na.rm = T)
} else {
  count = sum(apply(df[, vec], 1, function(x) { # third case
    all(x == 1) }), na.rm = T)
}

df is a data.frame with only 1, 0 or NA values. vec is a sub-vector of the colnames(df).

  • First case: count the rows thta after the NA's are removed, they have only 0's (or nothing - e.g. the row had only NA's - you count it too)
  • Second case: count the 1's in the vector (1 column chosen only) after removing the NA's
  • Third case: from the filtered data.frame get the number of rows that have all their values equal to 1.

Question

Is there any way you think that can make this code run faster using dplyr or something else since it manipulates the data frame by row? For example, when I exchanged the easier one (2nd case) - count = sum(df[, vec], na.rm = T) with dplyr: sum(df %>% select(vec), na.rm = T) and did a benchmark, it was considerably worse (but ok I don't think 2nd case can get considerably faster with any method).

Any tips or tricks for 2st and 3rd cases are welcome!

Benchmarking

A huge enough data.frame to play with: df = matrix(data = sample(c(0,1,NA), size = 100000, replace = TRUE), nrow = 10000, ncol = 10).

  • The first case:
rbenchmark::benchmark("prev" = {sum(apply(df, 1, function(x) {all(x == 0, na.rm = T)}))}, "new-long" = {sum((rowSums(df == 0, na.rm = TRUE) + rowSums(is.na(df)) == ncol(df)))}, "new-short" = {sum(!rowSums(df != 0, na.rm = TRUE))}, replications = 1000, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))

Results:

       test replications elapsed relative user.self sys.self
2  new-long         1000   1.267    1.412     1.267        0
3 new-short         1000   0.897    1.000     0.897        0
1      prev         1000  11.857   13.219    11.859        0
  • The third case (vec = 1:5 for example):
rbenchmark::benchmark("prev" = {sum(apply(df[, vec], 1, function(x) { all(x == 1) }), na.rm = T)}, "new" = {sum(!rowSums(replace(df[, vec], is.na(df[, vec]), -999) != 1))}, replications = 1000, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))

Results:

test replications elapsed relative user.self sys.self
2  new         1000   0.179    1.000     0.175    0.004
1 prev         1000   2.219   12.397     2.219    0.000

Overall, nice speedup using the rowSums! Use it too instead of apply!

John
  • 359
  • 4
  • 19
  • You may need `rowSums` which would be vectorized. Without a reproducible example, it is difficult for others to come up with a solution. For the first case `sum(!rowSums(df != 0, na.rm = TRUE))` The second case seems fine and the third case would be `sum(!rowSums(df[, vec] !=1, na.rm = TRUE))` – akrun Oct 14 '19 at 15:44
  • @akrun thanks! The third case is wrong though (outputs a lot more rows) - I want to keep only the rows that have all values equal to 1 and add them to the row count... – John Oct 14 '19 at 16:15
  • Not able to reproduce with this example `df <- data.frame(col1 = c(1, 2, 3, 1, 1), col2 = c(1, 1, 2, 1, 2));sum(!rowSums(df!=1, na.rm = TRUE))# [1] 2` – akrun Oct 14 '19 at 16:16
  • Maybe because the `df` has only 0, 1 and NA (I think)? – John Oct 14 '19 at 16:20
  • As I mentioned above, the post didn't show any reproducible example, so it is unclear – akrun Oct 14 '19 at 16:23
  • 1
    I know, I know :) Trying to make one! – John Oct 14 '19 at 16:25
  • Use: `df = data.frame(col1 = c(0, NA, NA, NA, 1, 0), col2 = c(0,0, NA, 1, 1, 1))`. Then the 3rd case would be like this: `sum(apply(df, 1, function(x) { all(x == 1) }), na.rm = T)` (result: 1 row only) and yours: `sum(!rowSums(df != 1, na.rm = TRUE))` (results: 3 rows) – John Oct 14 '19 at 16:29
  • Sorry, try this `sum(!rowSums(replace(df, is.na(df), -999) != 1))` – akrun Oct 14 '19 at 16:30

1 Answers1

1

Here is an option to optimize the code with rowSums for the first and third case. As there would be edge cases when the rows values are NA, one option is to replace those values with a value not in the dataset, create a logical matrix, use rowSums to convert it to a logical vector and get the sum of TRUE values

sum((rowSums(df == 0, na.rm = TRUE) + rowSums(is.na(df)) == ncol(df)))

Or

sum(!rowSums(df != 0, na.rm = TRUE))
sum(!rowSums(replace(df[, vec], is.na(df[, vec]), -999) != 1))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • The `na.rm` is needed in the first? (since you replace all NA's with -999) – John Oct 14 '19 at 16:36
  • @John No, it is not needed, I forgot to remove it. thanks – akrun Oct 14 '19 at 16:37
  • @John Please let me know if it improved the efficiency – akrun Oct 14 '19 at 16:37
  • 1
    Ok, testing it a little bit myself and benchmarking it and accepting it afterwards! Thanks! – John Oct 14 '19 at 16:38
  • First is not correct anymore! But we can keep the `sum(!rowSums(df != 0, na.rm = TRUE))` I guess... – John Oct 14 '19 at 16:43
  • @John Based on your example, I get `sum(!rowSums(replace(df, is.na(df), -999) != 0))#[1]` which is the number of all 0's in a row. Did you meant the logic to be without the `NA` or with the NA? – akrun Oct 14 '19 at 16:45
  • So, what the first case does is this: take a row, remove the NA's => if only zeros (0's) remain count that row. Even if there are no elements (all rows was NA's) count that row. – John Oct 14 '19 at 16:52
  • @John that means a row with all `NA` is also counted, right? – akrun Oct 14 '19 at 16:53
  • Yes that is true! – John Oct 14 '19 at 16:53
  • The solution for the first case with only one use of `rowSums` is a little bit faster than the one with two uses. – John Oct 14 '19 at 16:59
  • @John But, I guess it is still faster than the `apply`, right? – akrun Oct 14 '19 at 17:00
  • 1
    Updated answer with benchmark and overall results. Thanks! – John Oct 15 '19 at 10:00