0

I would like to make a some new variables in R based on multiple (>100) other variables.

My dataset looks like this

  sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br
1      1          1          1          2          2          1         6         0         1         6         1
2      2          2                                3          2         5                             1         3
3      3          0          0       <NA>          4          1         0         0      <NA>         2         2
4      4         NA                     1          2          2        NA                   1         1         4
5      5         NA          4          2          3          5        NA         4         3         4         3

The variables diag_x_ais can take integers from 0-6, and diag_x_br can take integers between 1-6. I would like to make 6 new variables corresponding to the 6 possible diag_x_br values, i.e. the new variables would be called br_1, br_2 ... br_6. These new variables shall then be filled with the maximum value of the corresponding diag_x_ais variables, i.e. if diag_1_br, diag_2_br, and diag_4_br are all 3, then br_3 should take the maximum value of diag_1_ais, diag_2_ais, and diag_4_ais.

Please also see the example dataset below:

  sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6
1      1          1          1          2          2          1         6         0         1         6         1    2   NA   NA   NA   NA    2
2      2          2          1          4          3          5         5         2         2         1         3    3    4    5   NA    2   NA
3      3          0          0         NA          4          1         0         0        NA         2         2   NA    4   NA   NA   NA   NA
4      4         NA                     1          2          2        NA                   1         1         4    2   NA   NA    2   NA   NA
5      5         NA          4          2          3          5        NA         4         3         4         3   NA   NA    5    4   NA   NA

Hereafter, I would like a final variable which calculates the sum of the up to three largest br_x variables, example displayed below:

  sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 sum3
1      1          1          1          2          2          1         6         0         1         6         1    2   NA   NA   NA   NA    2    4
2      2          2          1          4          3          5         5         2         2         1         3    3    4    5   NA    2   NA   12
3      3          0          0         NA          4          1         0         0        NA         2         2   NA    4   NA   NA   NA   NA    4
4      4         NA                     1          2          2        NA                   1         1         4    2   NA   NA    2   NA   NA    4
5      5         NA          4          2          3          5        NA         4         3         4         3   NA   NA    5    4   NA   NA    9

My actual dataset has 60 diag_x_ais variables and 60 diag_x_br variables and 4000 rows.

I hope that someone can help me do this in R. Thank you!

trine
  • 5
  • 1
  • 1
    The relationship between `diag_1_ais` and `diag_2_br` and how to calculate your `br_x` is a bit unclear, could you please elaborate on this? – Anoushiravan R Aug 13 '21 at 09:23
  • 1
    Ofc. I will break the answer into two parts as I reach the character limit. diag_1_ais is related to diag_1_br, diag_2_ais is related to diag_2_br and so on (there are 60 of these "couples" in total). I would then like to create six new br_ variables from br_1 to br_6. The br_ variables are named after the values that diag_x_br can take, which are integers 1-6. If we start with br_1 then I would like this to "search" for 1s in all the diag_x_br variables (i.e. 60 variables) and plug in the maximum value of the corresponding diag_x_ais variable. – trine Aug 13 '21 at 09:47
  • 1
    So let's say diag_3_br, diag_4_br and diag_7_br all take the value 1, and their corresponding diag_3_ais, diag_4_ais and diag_7_ais take the values 1, 4, 2, respectively. Then br_1 should take the value 4, as this is the maximum value of the corresponding diag_x_ais values. Does this make sense? – trine Aug 13 '21 at 09:47
  • I got your point. – Anoushiravan R Aug 13 '21 at 10:05

3 Answers3

2

I think you could use the following solution. I made a slight modification so that we only sum the first 3 max values:

library(dplyr)
library(purrr)

df %>%
  bind_cols(as.data.frame(t(map_dfr(1:6, function(a) pmap_dfc(df, ~ {x <- c(...)[grepl("br", names(df))]
  inds <- which(x == a) 
  if(length(inds) != 0) {
    y <- c(...)[grepl("ais", names(df))]
    max(y[inds])
  } else {
    NA
  }})))) %>%
    setNames(paste0("br", 1:6))) %>% 
  rowwise() %>%
  mutate(sum = sum(sort(as.numeric(c_across(starts_with("br"))), decreasing = TRUE)[1:3], na.rm = TRUE)) %>%
  select(starts_with("br"), sum)

Resulting output

# A tibble: 5 x 7
# Rowwise: 
  br1   br2   br3   br4   br5   br6     sum
  <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2     NA    NA    NA    NA    2         4
2 3     4     5     NA    2     NA       12
3 NA    4     NA    NA    NA    NA        4
4 2     NA    NA    2     NA    NA        4
5 NA    NA    5     4     NA    NA        9
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • 2
    Thank you all for helping me out here. I ended up using OneTrickDragon's solution as it was easier for me to understand, but I really appreciate your help and all three answers. Thanks again! – trine Aug 17 '21 at 09:47
2

You could use some heavy data.transforming most likely not very efficient on large datasets. There are some empty values, NA and 0 in your dataset. I didn't handle them (and replaced the empty values by NA to make importing easier).

library(tidyr)
library(dplyr)

data %>% 
  pivot_longer(-sub_id, 
               names_to = c("name", "cat"), 
               names_pattern = ".*_(\\d+)_(.*)") %>% 
  pivot_wider(names_from = "cat") %>% 
  group_by(sub_id, br) %>% 
  summarise(value = max(ais), .groups = "drop") %>% 
  filter(br %in% 1:6) %>%
  group_by(sub_id) %>% 
  mutate(sum = sum(tail(sort(value), 3))) %>% 
  pivot_wider(names_from = br,
              names_glue = "br_{br}") %>% 
  select(sub_id, paste0("br_", 1:6), sum)

This returns

# A tibble: 5 x 8
  sub_id  br_1  br_2  br_3  br_4  br_5  br_6   sum
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1      1     2    NA    NA    NA    NA     2     4
2      2     3     4     5    NA     2    NA    12
3      3    NA     4    NA    NA    NA    NA     4
4      4     2    NA    NA     2    NA    NA     4
5      5    NA    NA     5     4    NA    NA     9

Piping an addtional right_join(data, by = "sub_id") gives you your example output (minus the order of your columns). I took an idea from this answer.

Data

data <- structure(list(sub_id = c(1, 2, 3, 4, 5), diag_1_ais = c(1, 2, 
0, NA, NA), diag_2_ais = c(1, 1, 0, NA, 4), diag_3_ais = c(2, 
4, NA, 1, 2), diag_4_ais = c(2, 3, 4, 2, 3), diag_5_ais = c(1, 
5, 1, 2, 5), diag_1_br = c(6, 5, 0, NA, NA), diag_2_br = c(0, 
2, 0, NA, 4), diag_3_br = c(1, 2, NA, 1, 3), diag_4_br = c(6, 
1, 2, 1, 4), diag_5_br = c(1, 3, 2, 4, 3)), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))
halfer
  • 19,824
  • 17
  • 99
  • 186
Martin Gal
  • 16,640
  • 5
  • 21
  • 39
  • 1
    Nice solution. There were some discrepancies between the data sets and I was at my wit's end on why I couldn't get the desired results. – Anoushiravan R Aug 13 '21 at 11:34
  • 1
    @AnoushiravanR I missed the "sum the top 3 part". Still struggling with that one. The result of row 2 should be `12` instead of `14`. – Martin Gal Aug 13 '21 at 11:45
  • 1
    Finally fixed it. – Martin Gal Aug 13 '21 at 11:52
  • Great! That `tail` function is so helpful. If I could I would totally let go of `tidyverse` in favor of base R. – Anoushiravan R Aug 13 '21 at 11:53
  • @Martin Gal, Thank you all for helping me out here. I ended up using OneTrickDragon's solution as it was easier for me to understand, but I really appreciate your help and all three answers. Thanks again! – trine Aug 17 '21 at 09:46
1

For the first part:

data <- data.frame(sub_id = c(1,2,3,4,5),
                 diag_1_ais  = c(1,2,0,NA,NA),
                 diag_2_ais  = c(1,1,0,NA,4),
                 diag_3_ais  = c(2,4,NA,1,2),
                 diag_4_ais  = c(2,3,4,2,3),
                 diag_5_ais  = c(1,5,1,2,5),
                 diag_1_br = c(6,5,0,NA,NA),
                 diag_2_br = c(0,2,0,NA,4),
                 diag_3_br = c(1,2,NA,1,3),
                 diag_4_br = c(6,1,2,1,4),
                 diag_5_br = c(1,3,2,4,3))

calc_br <- function(data, value, firstBr, lastBr) {
    br <- c()
    for (i in 1:nrow(data)){
        if (length(which(data[i,c(firstBr:lastBr)] %in% value))!=0){
          br <- c(br, c(max(data[i,which(data[i,c(firstBr:lastBr)] %in% value)+1]))) 
        }
        else {
            br <- c(br, c(NA))
        }
    }
    result <- br
}

firstBr = 7
lastBr = 11

data$br_1 <- calc_br(data,1,firstBr,lastBr)
data$br_2 <- calc_br(data,2,firstBr,lastBr)
data$br_3 <- calc_br(data,3,firstBr,lastBr)
data$br_4 <- calc_br(data,4,firstBr,lastBr)
data$br_5 <- calc_br(data,5,firstBr,lastBr)
data$br_6 <- calc_br(data,6,firstBr,lastBr)

This should yield the same results as in your example. You should only have to exchange lastBr and firstBr (to 62 and 122 i would guess).

For the second part this should do the trick:

br_sum <- c()
for (i in 1:nrow(data)){
    br_sum <- c(br_sum, sum(data[i,lastBr+tail(order(data[i,c((lastBr+1):(lastBr+6))], na.last = NA), 3)]))
}
data$br_sum <- br_sum

For completness here my results:

  sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br
1      1          1          1          2          2          1         6
2      2          2          1          4          3          5         5
3      3          0          0         NA          4          1         0
4      4         NA         NA          1          2          2        NA
5      5         NA          4          2          3          5        NA
  diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 br_sum
1         0         1         6         1    2   NA   NA   NA   NA    2      4
2         2         2         1         3    3    4    5   NA    2   NA     12
3         0        NA         2         2   NA    4   NA   NA   NA   NA      4
4        NA         1         1         4    2   NA   NA    2   NA   NA      4
5         4         3         4         3   NA   NA    5    4   NA   NA      9
  • 1
    We only need to sum first max 3 values among `br_` variables. That's why in row2 the sum should be 12 instead of 14 :) – Anoushiravan R Aug 13 '21 at 12:00
  • 1
    Oh, snap i have overlooked that requirement. Thanks for the comment, I'll fix that! – OneTrickDragon Aug 13 '21 at 12:11
  • 1
    Forget to mention that I edited it and it should be working now. – OneTrickDragon Aug 14 '21 at 03:56
  • 1
    Thank you so much! I ended up using this answer as it was easier for me to understand than the two other answers, but I really appreciates everyone's effort. – trine Aug 17 '21 at 09:41
  • Happy to be helpful, I did not comment the code much but glad that it was understandable! But just keep in mind that my solution could be slower than the others. Since the loops could take time on larger datasets. – OneTrickDragon Aug 17 '21 at 11:26