1

I am trying to create a new column using existing column names within my dataset. Specifically, in this dataset, one row corresponds to one respiratory specimen and has ~10 virus test results, each with a corresponding column; a value of 0 indicates testing negative for this virus, while a value of 1 indicates testing positive for this virus. A simplified version of the dataset would look something like this:

Specimen ID FLUA FLUB RSV HMPV
1 1 0 1 0
2 0 1 1 0
3 0 1 1 1

I would like to create a new column called "virus_combinations" that lists out the viruses for which this specimen tested positive. I would like the combination of viruses to be listed in alphabetical order so that R will recognize the same combinations of viruses as being the same value (e.g., FLUA + RSV is the same combination as RSV + FLUA, so I want them listed in the same order).

Specimen ID FLUA FLUB RSV HMPV virus_combinations
1 1 0 1 0 FLUA, RSV
2 0 1 1 0 FLUB, RSV
3 0 1 1 1 FLUB, HMPV, RSV

As I mentioned, there are ~10 viruses I'm interested in and the number of viruses for which specimens tested positive ranges from 1-5. Therefore, it would be difficult to manually add all of the possible virus combinations with relevant test results.

I saw a related post on Stack Exchange, in which the solution involved the following code:

df1 <- df %>%
  group_by(ID) %>%
  summarise(output = paste(name[value == 1], collapse = ','))

I have tried adapting this code by removing the group_by() and summarise() commands and instead using it with mutate(), but I'm not sure if that's something I can actually do. I also don't understand how to specify which columns I want the new 'virus_combinations' variable to pull from.

df1 <- df %>%
  mutate(virus_combinations = paste(names[value == 1], collapse = ','))

Any relevant commands or packages that you can think of would be much appreciated. Thank you in advance!

r2evans
  • 141,215
  • 6
  • 77
  • 149
jcijac89
  • 11
  • 1
  • Your desired output looks like `mutate` but your first code attempt is `summarize`, do you want one row per ID or all original rows with `virus_combinations` repeated within each ID group? – r2evans Jul 05 '23 at 18:21
  • 1
    Thanks for your comment! Each specimen ID is unique, so there are no ID groups. I would like one row per ID with the relevant virus_combinations for each row. – jcijac89 Jul 05 '23 at 18:28

2 Answers2

1

in base R:

 cbind(df1, v = apply(df[-1], 1, \(x)toString(names(df1)[-1][x==1])))
  Specimen.ID FLUA FLUB RSV HMPV               v
1           1    1    0   1    0       FLUA, RSV
2           2    0    1   1    0       FLUB, RSV
3           3    0    1   1    1 FLUB, RSV, HMPV

aggregate(cbind(virus_combination = as.character(ind)) ~ Specimen.ID, 
   cbind(df1['Specimen.ID'], stack(df1, -Specimen.ID)),
   toString, subset = values > 0 ) |>
   merge(df1, y = _)

 Specimen.ID FLUA FLUB RSV HMPV virus_combination
1           1    1    0   1    0         FLUA, RSV
2           2    0    1   1    0         FLUB, RSV
3           3    0    1   1    1   FLUB, RSV, HMPV

a <- as.matrix(df[-1])
cbind(df, v = tapply(names(df)[(a * col(a)) + 1], a*row(a), toString)[-1])

  Specimen_ID FLUA FLUB RSV HMPV               v
1           1    1    0   1    0       FLUA, RSV
2           2    0    1   1    0       FLUB, RSV
3           3    0    1   1    1 FLUB, RSV, HMPV
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

Another dplyr alternative:

quux %>%
  group_by(`Specimen ID`) %>%
  mutate(
    virii = paste(sort(
        apply(pick(FLUA:HMPV), 1, \(z) names(which(z == 1)))
      ), collapse = ", ")
  ) %>%
  ungroup()
# # A tibble: 3 × 6
#   `Specimen ID`  FLUA  FLUB   RSV  HMPV virii          
#           <int> <int> <int> <int> <int> <chr>          
# 1             1     1     0     1     0 FLUA, RSV      
# 2             2     0     1     1     0 FLUB, RSV      
# 3             3     0     1     1     1 FLUB, HMPV, RSV

Data

quux <- structure(list("Specimen ID" = 1:3, FLUA = c(1L, 0L, 0L), FLUB = c(0L, 1L, 1L), RSV = c(1L, 1L, 1L), HMPV = c(0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -3L))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • apply within mutate? Maybe use rowwise somehow? – zx8754 Jul 05 '23 at 18:48
  • I found `apply` to be an easier way to find the names of the selected columns, `rowwise` does not enable that directly @zx8754. I'm likely missing something. Recommendations? – r2evans Jul 05 '23 at 19:02
  • See linked post, looks like rowwise works - https://stackoverflow.com/a/69357873/680068 (I don't know dplyr much, thought you might know more :) ) – zx8754 Jul 05 '23 at 19:04
  • 1
    `cur_data()` is deprecated ... but you've identified the question of which this is a duplicate (and closed as such, nice) – r2evans Jul 05 '23 at 19:05
  • I see, maybe comment at Stéphane Laurent answer? – zx8754 Jul 05 '23 at 19:06