0

I have a middle sized dataset (> 20 million observations) and I want to use a shiny app to filter the desired information about specific subgroups in the dataset.

Right now, I want to assign for each observation a subgroup number (1 = subgroup1, 2 = subgroup2 and so forth). However, before the desired subgroups are determined and submitted via the shiny app interface, the total number of subgroups is unknown.

Therefore, I create a global variable that stores the number of subgroups after the information is submitted. With that information I would like to combine a for loop and the ifelse function to create a column in the dataset that stores the respective subgroup for each observation.

Imagine the case with the following sample data: Overall, the dataset contains three columns. The first column contains the unique id of each observation. The second column is titled "insurance" and contains a name of insurance companies. The third column is titled "response_code" and contains numbers between 1 and 5 (as a string). This column tells us how the observation rated its satisfaction with the insurance company.

As an example, the dataset might look like this:

ID insurance response_code
1 Insurance_Company_1 "1"
2 Insurance_Company_2 "2"
3 Insurance_Company_4 "3"
4 Insurance_Company_3 "1"
5 Insurance_Company_6 "2"
6 Insurance_Company_2 "4"
7 Insurance_Company_1 "5"
8 Insurance_Company_4 "1"

Code:

df <- data.frame(ID = c(1,2,3,4,5,6,7,8), 
                 insurance = c("Insurance_Company_1", "Insurance_Company_2", 
                               "Insurance_Company_4", "Insurance_Company_3",
                               "Insurance_Company_6", "Insurance_Company_2",
                               "Insurance_Company_1", "Insurance_Company_4"),
                 response_code = c("1", "2", "3", "1", "2", "4", "5", "1"))

Besides, I have a few variables with characters that contain the name of certain insurance companies and certain response codes. These variables look like this:

labels_subgroup_1 <- c("Insurance_Company_1", "Insurance_Company_2")

labels_subgroup_2 <- c("Insurance_Company_3")

labels_subgroup_3 <- c("Insurance_Company_4", "Insurance_Company_5", "Insurance_Company_6").

response_codes_subgroup_1 <- c("1")

response_codes_subgroup_2 <- c("1", "2")

response_codes_subgroup_3 <- c("2", "3")

Last but not least, I have a variable that stores the total number of the desired subgroups:

number_of_subgroups <- 3 

(This number is determined as soon as the data of the shiny app interface are submitted)

As a final result, I want to test for each observation: If the name of the insurance company is contained in one of the aforementioned labels_subgroup variables (e.g. Insurance_Company_2) AND the response_code is contained in the respective reponse_codes_subgroup variable (e.g. in the case of Insurance_Company_2 -> "2" or "3"), THEN the respective subgroup number is assigned to the observation. Otherwise, the subgroup number remains zero.

In the example, the desired result should look like this:

ID insurance response_code subgroup_number
1 Insurance_Company_1 "1" 1
2 Insurance_Company_2 "2" 0
3 Insurance_Company_4 "3" 3
4 Insurance_Company_3 "1" 2
5 Insurance_Company_6 "2" 3
6 Insurance_Company_2 "4" 0
7 Insurance_Company_1 "5" 0
8 Insurance_Company_4 "1" 0

I already found a solution with two nested for-loops, however this approach is way too slow for this amount of data:

df$subgroup_number <- 0
for(i in 1:nrow(df)) {
  for(j in 1:number_of_subgroups) {
    if(df$subgroup_number[i] == 0 && 
       df$insurance[i] %in% eval(parse(text = paste0("labels_subgroup_", j))) && 
       df$response_code[i] %in% eval(parse(text = paste0("response_codes_subgroup_", j)))) {
       df$subgroup_number[i] <- j
    }
  }
}

Consequently, I tried to implement an approach with mutate and ifelse() within a for loop. Unfortunately, only the last iteration of the loop is saved in the data:

df$subgroup_number <- 0
for(i in 1:number_of_subgroups) {
  df <- df %>% 
    mutate(subgroup_number = ifelse(
    subgroup_number == 0 &
    insurance %in% eval(parse(text = paste0("labels_subgroup_", i))) &
    response_code %in% eval(parse(text = paste0("response_codes_subgroup_", i))), 
    i, 0))
}

Result:

ID insurance response_code subgroup_number
1 Insurance_Company_1 "1" 0
2 Insurance_Company_2 "2" 0
3 Insurance_Company_4 "3" 3
4 Insurance_Company_3 "1" 0
5 Insurance_Company_6 "2" 3
6 Insurance_Company_2 "4" 0
7 Insurance_Company_1 "5" 0
8 Insurance_Company_4 "1" 0

Is there a solution which saves the results of all iterations in the dataframe? Or is there a possible approach with the case_when function that might work within a for-loop? Thanks in advanace for your help.

SG_95
  • 13
  • 4
  • You don't need to loop over the rows as `ifelse/case_when` are vectorized. i.e. `df %>% mutate(subgroup_number = case_when(insurance %in% labels_list[[1]] & response_code %in% response_list[[1]]~ 1, insurance %in% labels_list[[2]] & response_code %in% response_list[[2]]~ 2,insurance %in% labels_list[[3]] & response_code %in% response_list[[3]]~ 3 ))` where `labels_list <- mget(ls(pattern = "labels_subgroup")); response_list <- mget(ls(pattern = "response_codes"))` – akrun Sep 21 '22 at 16:41
  • Thank you very much for your answer. Unfortunately, your suggestion seems to be hardcoded. I am looking for a solution that is "flexible" since the number of subgroups is a variable that depends on the input of the user. – SG_95 Sep 22 '22 at 07:20

1 Answers1

0

For everyone who is still interested in the solution, I used the following approach:

df$subgroup_number <- 0
subgroup_number_list <- vector(mode = "list", length = number_of_subgroups)
for(i in 1:number_of_subgroups) {
  subgroup_number_list[[i]] = ifelse(
  df$insurance %in% eval(parse(text = paste0("labels_subgroup_", i))) &
  df$response_code %in% eval(parse(text = paste0("response_codes_subgroup_", i))), 
  i, 0)
}

subgroup_number_df <- mapply(c, subgroup_number_list)

subgroup_number_vec <- rowSums(subgroup_number_df )

df$subgroup_number <- subgroup_number_vec

There might be a better or cleaner solution, but I am more than satisfied with the performance of my approach.

SG_95
  • 13
  • 4