2

I have written in R three different methods to make the variable S1_Total_Time that is the sum of a number of columns using dplyr mutate. When I run them, I get different outputs for the variable S1_Total_Time. In my dataframe each line is a person, so I use rowwise to make the sums go row by row.

In Method A, I get a sum of all the column values rowwise, ignoring NAs. Every row ends up with a value, but there is an issue. One row was a person with all NAs, and their summed TotalTime is 0, which is not accurate. Their S1_TotalTime should be NA, not 0. To solve that, I tried Method B.

In Method B, I also get a sum of all the column values rowwise. Here, I also ask it to make a determination first using ifelse as I want it to only count the timing value if both of these things are true:

  1. The companion variable to the timing variable for errors is not NA (this helps me exclude trials where the 'next' button was clicked, giving it a time, but no actual data)
  2. The timing variable I'm interested in and want to return is also not NA If either of these are NA, then I ask it to return NA for that variable that will be included in the sum. At the end of method B, I ask it to exclude NAs, so that it should only sum the products of the ifelses that were not NA. This does not work, it returns S1_TotalTime as NA for any line that had any NAs.

If I change method B to return 0 or another number if either 1 or 2 is false instead of NA (Method C), it works the almost the same as method A with some mysterious deviations.

  1. 37/41 rows are off by exactly +1 from method A.
  2. 4/41 rows are off by more than 1, where their method A is from -14 to -127 lower than their method B.

My Questions:

  1. Why do Methods A and C produce different results?

  2. Am I missing something about trying to make sure both variables have data before summing, and if they don't, excluding them but still getting a sum from the variables that were not made NA by my ifelse? Method A is only a problem because of the few lines where ALL timing variables are NA, and I can't have the sum be 0 for those.

    \> sessionInfo()

    R version 4.2.0 (2022-04-22)

    Platform: x86_64-apple-darwin17.0 (64-bit)

    Running under: macOS Monterey 12.4

    Matrix products: default

    LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

    locale:

    [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8#

    attached base packages:

    [1] stats graphics grDevices utils datasets methods base

    other attached packages:

    [1] nlme_3.1-157 rstatix_0.7.2 ggpubr_0.6.0 reshape2_1.4.4 data.table_1.14.2 lubridate_1.9.2

    [7] forcats_1.0.0 stringr_1.5.0 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.1.8

    [13] tidyverse_2.0.0 Hmisc_4.7-0 Formula_1.2-4 survival_3.3-1 lattice_0.20-45 dplyr_1.1.0

    [19] ggplot2_3.4.1 haven_2.5.1

    Sample Data:

Data_for_analysis \<- structure(list(S1_1.1_timing_Page_Submit = c(38, 12, NA, 52),
S1_1.2a_timing_Page_Submit = c(19, 14, NA, 68), S1_1.2b_timing_Page_Submit = c(48,
65, NA, 190), S1_1.3_timing_Page_Submit = c(66, 9, NA, 20
), S1_2.1_timing_Page_Submit = c(307, 153, NA, 90.38), S1_2.2_timing_Page_Submit = c(NA,
28, NA, 3.752), S1_2.3_timing_Page_Submit = c(NA, NA, NA,
58\.996), S1_2.4a_timing_Page_Submit = c(NA, NA, NA, 1.203
), S1_2.4b_timing_Page_Submit = c(NA, NA, NA, 61.671), S1_2.5_timing_Page_Submit = c(NA,
NA, NA, 0.249), S1_2.6_timing_Page_Submit = c(NA, NA, NA,
0\.201), S1_3.1_timing_Page_Submit = c(NA, NA, NA, 0.244),
S1_3.2_timing_Page_Submit = c(NA, NA, NA, 0.224), S1_3.3_timing_Page_Submit = c(NA,
NA, NA, 0.158), S1_3.4_timing_Page_Submit = c(NA, NA, NA,
0\.2), S1_3.5_timing_Page_Submit = c(NA, NA, NA, 0.159), S1_3.6_timing_Page_Submit = c(NA,
NA, NA, 0.695), S1_3.7_timing_Page_Submit = c(NA, NA, NA,
0\.263), S1_3.8_timing_Page_Submit = c(NA, NA, NA, 0.267),
S1_3.9_timing_Page_Submit = c(NA, NA, NA, 0.136), S1_3.10_timing_Page_Submit = c(NA,
NA, NA, 0.216), S1_3.11_timing_Page_Submit = c(NA, NA, NA,
0\.249), S1_3.12_timing_Page_Submit = c(NA, NA, NA, 3.127),
ErrorNum_S1_1.1 = c(1, 0, 0, 1), ErrorNum_S1_1.2a = c(0,
0, 0, 1), ErrorNum_S1_1.2b = c(0, 1, 0, 2), ErrorNum_S1_1.3 = c(0,
0, 0, 0), ErrorNum_S1_2.1 = c(13, 4, 1, 1), ErrorNum_S1_2.2 = c(NA,
0, 0, 5), ErrorNum_S1_2.3 = c(NA, 7, 0, NA), ErrorNum_S1_2.4a = c(NA,
NA, 0, NA), ErrorNum_S1_2.4b = c(NA, NA, 3, NA), ErrorNum_S1_2.5 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_2.6 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.1 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.2 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.3 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.4 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.5 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.6 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.7 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.8 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.9 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.10 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.11 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_), ErrorNum_S1_3.12 = c(NA_real\_,
NA_real\_, NA_real\_, NA_real\_)), class = c("rowwise_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L), groups = structure(list(
.rows = structure(list(1L, 2L, 3L, 4L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame")))


library(haven)
library(ggplot2)
library(dplyr)
library(Hmisc)
library(tidyverse)

#Method A
Data_for_analysis_A \<- Data_for_analysis %\>%
rowwise() %\>%
mutate(S1_TotalTime = sum(S1_1.1_timing_Page_Submit,
S1_1.2a_timing_Page_Submit,
S1_1.2b_timing_Page_Submit,
S1_1.3_timing_Page_Submit,
S1_2.1_timing_Page_Submit,
S1_2.2_timing_Page_Submit,
S1_2.3_timing_Page_Submit,
S1_2.4a_timing_Page_Submit,
S1_2.4b_timing_Page_Submit,
S1_2.5_timing_Page_Submit,
S1_2.6_timing_Page_Submit,
S1_3.1_timing_Page_Submit,
S1_3.2_timing_Page_Submit,
S1_3.3_timing_Page_Submit,
S1_3.4_timing_Page_Submit,
S1_3.5_timing_Page_Submit,
S1_3.6_timing_Page_Submit,
S1_3.7_timing_Page_Submit,
S1_3.8_timing_Page_Submit,
S1_3.9_timing_Page_Submit,
S1_3.10_timing_Page_Submit,
S1_3.11_timing_Page_Submit,
S1_3.12_timing_Page_Submit, na.rm=TRUE))


Data_for_analysis_B \<- Data_for_analysis %\>%
rowwise() %\>%
mutate(S1_TotalTime = sum(
ifelse(!is.na(ErrorNum_S1_1.1)  && !is.na(S1_1.1_timing_Page_Submit) , S1_1.1_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_1.2a) && !is.na(S1_1.2a_timing_Page_Submit), S1_1.2a_timing_Page_Submit, NA),
ifelse(!is.na(ErrorNum_S1_1.2b) && !is.na(S1_1.2b_timing_Page_Submit), S1_1.2b_timing_Page_Submit, NA),
ifelse(!is.na(ErrorNum_S1_1.3)  && !is.na(S1_1.3_timing_Page_Submit) , S1_1.3_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_2.1)  && !is.na(S1_2.1_timing_Page_Submit) , S1_2.1_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_2.2)  && !is.na(S1_2.2_timing_Page_Submit) , S1_2.2_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_2.3)  && !is.na(S1_2.3_timing_Page_Submit) , S1_2.3_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_2.4a) && !is.na(S1_2.4a_timing_Page_Submit), S1_2.4a_timing_Page_Submit, NA),
ifelse(!is.na(ErrorNum_S1_2.4b) && !is.na(S1_2.4b_timing_Page_Submit), S1_2.4b_timing_Page_Submit, NA),
ifelse(!is.na(ErrorNum_S1_2.5)  && !is.na(S1_2.5_timing_Page_Submit) , S1_2.5_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_2.6)  && !is.na(S1_2.6_timing_Page_Submit) , S1_2.6_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_3.1)  && !is.na(S1_3.1_timing_Page_Submit) , S1_3.1_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_3.2)  && !is.na(S1_3.2_timing_Page_Submit) , S1_3.2_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_3.3)  && !is.na(S1_3.3_timing_Page_Submit) , S1_3.3_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_3.4)  && !is.na(S1_3.4_timing_Page_Submit) , S1_3.4_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_3.5)  && !is.na(S1_3.5_timing_Page_Submit) , S1_3.5_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_3.6)  && !is.na(S1_3.6_timing_Page_Submit) , S1_3.6_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_3.7)  && !is.na(S1_3.7_timing_Page_Submit) , S1_3.7_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_3.8)  && !is.na(S1_3.8_timing_Page_Submit) , S1_3.8_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_3.9)  && !is.na(S1_3.9_timing_Page_Submit) , S1_3.9_timing_Page_Submit,  NA),
ifelse(!is.na(ErrorNum_S1_3.10) && !is.na(S1_3.10_timing_Page_Submit), S1_3.10_timing_Page_Submit, NA),
ifelse(!is.na(ErrorNum_S1_3.11) && !is.na(S1_3.11_timing_Page_Submit), S1_3.11_timing_Page_Submit, NA),
ifelse(!is.na(ErrorNum_S1_3.12) && !is.na(S1_3.12_timing_Page_Submit), S1_3.12_timing_Page_Submit, NA), rm.na=TRUE))


#Method C (same as B but using 0 instead of NA)
Data_for_analysis_C \<- Data_for_analysis %\>%
rowwise() %\>%
mutate(S1_TotalTime = sum(
ifelse(!is.na(ErrorNum_S1_1.1)  && !is.na(S1_1.1_timing_Page_Submit) , S1_1.1_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_1.2a) && !is.na(S1_1.2a_timing_Page_Submit), S1_1.2a_timing_Page_Submit, 0),
ifelse(!is.na(ErrorNum_S1_1.2b) && !is.na(S1_1.2b_timing_Page_Submit), S1_1.2b_timing_Page_Submit, 0),
ifelse(!is.na(ErrorNum_S1_1.3)  && !is.na(S1_1.3_timing_Page_Submit) , S1_1.3_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_2.1)  && !is.na(S1_2.1_timing_Page_Submit) , S1_2.1_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_2.2)  && !is.na(S1_2.2_timing_Page_Submit) , S1_2.2_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_2.3)  && !is.na(S1_2.3_timing_Page_Submit) , S1_2.3_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_2.4a) && !is.na(S1_2.4a_timing_Page_Submit), S1_2.4a_timing_Page_Submit, 0),
ifelse(!is.na(ErrorNum_S1_2.4b) && !is.na(S1_2.4b_timing_Page_Submit), S1_2.4b_timing_Page_Submit, 0),
ifelse(!is.na(ErrorNum_S1_2.5)  && !is.na(S1_2.5_timing_Page_Submit) , S1_2.5_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_2.6)  && !is.na(S1_2.6_timing_Page_Submit) , S1_2.6_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_3.1)  && !is.na(S1_3.1_timing_Page_Submit) , S1_3.1_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_3.2)  && !is.na(S1_3.2_timing_Page_Submit) , S1_3.2_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_3.3)  && !is.na(S1_3.3_timing_Page_Submit) , S1_3.3_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_3.4)  && !is.na(S1_3.4_timing_Page_Submit) , S1_3.4_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_3.5)  && !is.na(S1_3.5_timing_Page_Submit) , S1_3.5_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_3.6)  && !is.na(S1_3.6_timing_Page_Submit) , S1_3.6_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_3.7)  && !is.na(S1_3.7_timing_Page_Submit) , S1_3.7_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_3.8)  && !is.na(S1_3.8_timing_Page_Submit) , S1_3.8_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_3.9)  && !is.na(S1_3.9_timing_Page_Submit) , S1_3.9_timing_Page_Submit,  0),
ifelse(!is.na(ErrorNum_S1_3.10) && !is.na(S1_3.10_timing_Page_Submit), S1_3.10_timing_Page_Submit, 0),
ifelse(!is.na(ErrorNum_S1_3.11) && !is.na(S1_3.11_timing_Page_Submit), S1_3.11_timing_Page_Submit, 0),
ifelse(!is.na(ErrorNum_S1_3.12) && !is.na(S1_3.12_timing_Page_Submit), S1_3.12_timing_Page_Submit, 0), rm.na=TRUE))
  • I might be missing something, but wouldn't including the `ErrorNum` columns in Methods B and C mean you'd get different results? For example, the 59.0 in S1_2.3_timing_Page_Submit ( `Data_for_analysis %>% select(S1_2.3_timing_Page_Submit, ErrorNum_S1_2.3)` ) would be included in Method A (as it'd ignore the fact `ErrorNum_S1_2.3` is NA for that row), but be excluded from Methods B and C. – Hobo Apr 06 '23 at 01:45
  • That's a great thought - I assumed the ErrorNum variable checking, to make sure it wasn't NA, wouldn't make much of a difference as only a few lines might be affected. It might explain the 4 that are quite different when using method A vs B. But what it doesn't explain is why 37/41 rows are off by exactly 1 and especially not +1. If anything, the ErrorNum checking should exclude some data from the sum, making it less than the Method A total, not more. (The Method A sum is correct as checked by calculating in SPSS). – Anne McLaughlin Apr 06 '23 at 02:42
  • 1
    I was at work the other day, so didn't have a proper dig. The upshot is you're using `rm.na=TRUE`, when it should be `na.rm=TRUE`. So the extra 1 is coming from `TRUE` being treated as an extra data point, which is why it's off by 1. Seems to work properly (including method B) if you fix that. – Hobo Apr 07 '23 at 06:41
  • Thank you. Dyslexia is often the core issue. – Anne McLaughlin May 04 '23 at 14:50

1 Answers1

1

Perhaps something like this?

Data_for_analysis %>%
  rowwise() %>%
  mutate(sum = if_else(
    sum(is.na(c_across(starts_with("S1")))) == 23, NA,
    sum(c_across(starts_with("S1")), na.rm = TRUE)))
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Thank you. This solved the problem. I still don't understand why Method C adds exactly 1 to the sum, but now I can move on. This was the final code I used across my dataset: `Data_for_analysis <- Data_for_analysis %>% rowwise() %>% mutate(sum = if_else(sum(is.na(c_across(starts_with("S1") & ends_with("timing_Page_Submit")))) == 23, NA, sum(c_across(starts_with("S1") & ends_with("timing_Page_Submit")), na.rm = TRUE)))` – Anne McLaughlin Apr 06 '23 at 12:26