Vectorized recoding of rows in R data frame based on value in another column

Question

I have a dataset of binary responses (0, 1) to a number of questions like below. Each row indicates an individual and each column is a response to a question. "completed" indicates how many questions were reached. For example, if completed = 2 only q_1 and q_2 were responded too by that individual. I want to rescore each "q" column such than any column number greater than "completed" column is a 0, otherwise use the value from the corresponding "q" column.

have = data.frame(q_1 = c(1,0,1,1,0),
                      q_2 = c(1,1,1,1,0),
                      q_3 = c(0,0,1,1,0),
                      q_4 = c(1,0,0,1,1),
                      q_5 = c(1,0,0,0,1),
                      completed = c(2, 3, 2, 4, 1))

> have
  q_1 q_2 q_3 q_4 q_5 completed
1   1   1   0   1   1         2
2   0   1   0   0   0         3
3   1   1   1   0   0         2
4   1   1   1   1   0         4
5   0   0   0   1   1         1

How can I get to this output? Would it be easier to transform the dataset?

> want
  q_1 q_2 q_3 q_4 q_5 completed scored_1 scored_2 scored_3 scored_4 scored_5
1   1   1   0   1   1         2        1        1        0        0        0
2   0   1   0   0   0         3        0        1        0        0        0
3   1   1   1   0   0         2        1        1        0        0        0
4   1   1   1   1   0         4        1        1        1        1        0
5   0   0   0   1   1         1        0        0        0        0        0

This code will get the correct output. However, my real dataset is very large so I would need to be able to loop through the columns.

want = have %>%
  mutate(scored_1 = ifelse(completed >= 1, q_1, 0),
        scored_2 = ifelse(completed >= 2, q_2, 0),
        scored_3 = ifelse(completed >= 3, q_3, 0),
        scored_4 = ifelse(completed >= 4, q_4, 0),
        scored_5 = ifelse(completed >= 5, q_5, 0))

I don't get it, if completed =3, shouldn't Q1, Q2 and Q3 be 1? I don't understand what you mean by "how many questions were reached" — eduardokapp, May 14 '21 at 17:18
If completed = 3 then 3 questions (q1, q2, and q3) were reached and their responses are indicated by the 0s and 1s in the columns. But since only 3 questions were reached, the response in q4 and q5 are invalid and I need them to be wiped to 0s. — Kate N, May 14 '21 at 17:29

Ray · Answer 1 · 2021-05-14T17:29:37.070

0

If you do not need any other logic, you can easily convert your completed score into conditions for populating your columns.

library(dplyr)

have %>% 
    mutate( scored_1 = ifelse(completed != 0,  q_1, 0)
            ,scored_2 = ifelse(completed >= 2, q_2, 0)
            ,scored_3 = ifelse(completed >= 3, q_3, 0)
            ,scored_4 = ifelse(completed >= 4, q_4, 0)
            ,scored_5 = ifelse(completed == 5, q_5, 0)
    )

edited May 14 '21 at 17:29

answered May 14 '21 at 16:18

Ray

2,008
14
21

I need the "scored_" columns to equal the corresponding "q_" column unless the question was not reach which is indicated by the "completed" column. – Kate N May 14 '21 at 17:15
Apologies Kate. Just work on the condition of the `ifelse()` to fit your conditions. I adapted the answer. – Ray May 14 '21 at 17:30
Yes that will work! I just will need to be able to loop through the columns because my actual dataset is much larger. – Kate N May 14 '21 at 17:33

eduardokapp · Answer 2 · 2021-05-14T18:12:07.147

I'm assuming if "completed" = N, then the individual actually wrote down the answers to the questions up to N. Right? Please correct me if I'm wrong.

If that's the case, I have a vectorized solution:

have = data.frame(q_1 = c(1,0,1,1,0),
                      q_2 = c(1,1,1,1,0),
                      q_3 = c(0,0,1,1,0),
                      q_4 = c(1,0,0,1,1),
                      q_5 = c(1,0,0,0,1),
                      completed = c(2, 3, 2, 4, 1))

check <- function(x) {
    # which column number is "completed"
    stop <- which(names(x) == "completed")
    start <- x[["completed"]]
    x <- as.numeric(x)
    x[(1:length(x) > start) & (1:length(x) < stop)] <- 0
    return(x)
}

want <- have
for (line in 1:nrow(want)) {
    want[line, ] <- check(want[line, ])
}

score 0 · Accepted Answer · answered May 14 '21 at 17:37

You may also try mutate(across(.. in dplyr. This way you can mutate as many columns as you want (e.g. where names start from q_)

have = data.frame(q_1 = c(1,0,1,1,0),
                  q_2 = c(1,1,1,1,0),
                  q_3 = c(0,0,1,1,0),
                  q_4 = c(1,0,0,1,1),
                  q_5 = c(1,0,0,0,1),
                  completed = c(2, 3, 2, 4, 1))
library(tidyverse)

have %>%
  mutate(across(starts_with('q_'), ~ifelse(completed >= as.numeric(str_remove(cur_column(), 'q_')), ., 0),
                .names = 'scored_{.col}')) %>%
  rename_with(~str_remove(., 'q_'), starts_with('scored'))
#>   q_1 q_2 q_3 q_4 q_5 completed scored_1 scored_2 scored_3 scored_4 scored_5
#> 1   1   1   0   1   1         2        1        1        0        0        0
#> 2   0   1   0   0   0         3        0        1        0        0        0
#> 3   1   1   1   0   0         2        1        1        0        0        0
#> 4   1   1   1   1   0         4        1        1        1        1        0
#> 5   0   0   0   1   1         1        0        0        0        0        0

^{Created on 2021-05-14 by the reprex package (v2.0.0)}

score 0 · Answer 4 · answered May 14 '21 at 19:25

0

Using col:

h = have[startsWith(names(have), "q")]
h[col(h) > have$completed] = 0
cbind(have, setNames(h, paste0("s_", 1:ncol(h))))

answered May 14 '21 at 19:25

Henrik

65,555
14
143
159

Vectorized recoding of rows in R data frame based on value in another column

4 Answers4