Conditional calculation of new variable based on specific groups of rows, row values and columns in R dataframe

Question

Example Data

I have got the following survey dataset in R and need help with the conditional calculation of a specific new variable.

# Load package
library(tidyverse)

# Important: set seed for replicability
set.seed(123)

# Create data: step 1
df <- tibble(
  country = c(rep("A", 10), rep("B", 10)),
  respondent_id = 1:20,
  vote_choice = c(sample(c("PartyA", "PartyB", "PartyC"), 10, replace = TRUE),
                  sample(c("PartyD", "PartyE", "PartyF"), 10, replace = TRUE)),
  ptv_1 = runif(20, min = 0, max = 1) %>% round(., 3),
  ptv_2 = runif(20, min = 0, max = 1) %>% round(., 3),
  ptv_3 = runif(20, min = 0, max = 1) %>% round(., 3)
)

# Create data: step 2
df <- df %>% 
  group_by(vote_choice, country) %>%
  summarize(across(starts_with("ptv"), \(x) mean(x, na.rm = TRUE))) %>%
  pivot_longer(cols = starts_with("ptv"), names_to = "party_to_ptv", values_to = "average_value") %>%
  group_by(vote_choice, country) %>%
  slice_max(order_by = average_value) %>%
  ungroup() %>%
  mutate(average_value = NULL) %>%
  right_join(., df, by = c("vote_choice", "country"))

# Inspect data
df

# A tibble: 20 × 7
   vote_choice country party_to_ptv respondent_id ptv_1 ptv_2 ptv_3
   <chr>       <chr>   <chr>                <int> <dbl> <dbl> <dbl>
 1 PartyA      A       ptv_2                   10 0.691 0.799 0.710 
 2 PartyB      A       ptv_3                    4 0.544 0.233 0.810 
 3 PartyB      A       ptv_3                    6 0.289 0.266 0.794
 4 PartyB      A       ptv_3                    7 0.147 0.858 0.440 
 5 PartyB      A       ptv_3                    8 0.963 0.046 0.754
 6 PartyC      A       ptv_1                    1 0.994 0.369 0.274
 7 PartyC      A       ptv_1                    2 0.656 0.152 0.815
 8 PartyC      A       ptv_1                    3 0.709 0.139 0.449
 9 PartyC      A       ptv_1                    5 0.594 0.466 0.812
10 PartyC      A       ptv_1                    9 0.902 0.442 0.629
11 PartyD      B       ptv_3                   13 0.478 0.207 0.220 
12 PartyD      B       ptv_3                   16 0.318 0.895 0.352
13 PartyD      B       ptv_3                   19 0.415 0.095 0.668
14 PartyD      B       ptv_3                   20 0.414 0.384 0.418
15 PartyE      B       ptv_1                   11 0.795 0.122 0.001
16 PartyE      B       ptv_1                   12 0.025 0.561 0.475
17 PartyE      B       ptv_1                   14 0.758 0.128 0.380 
18 PartyF      B       ptv_2                   15 0.216 0.753 0.613
19 PartyF      B       ptv_2                   17 0.232 0.374 0.111
20 PartyF      B       ptv_2                   18 0.143 0.665 0.244

Information on the variables:

country encompasses 2 countries in my example df which each contain 10 respondents and a set of 3 distinct political parties respondents got to choose between at the last election (the real data also contain a variable year which I did not include for the sake of simplicity)
respondent_id refers to the respondent in the survey dataset and demonstrates that the dataset is at respondent-level but can otherwise be ignored
vote_choice denotes by name the party the respondent voted for at the last election
ptv_1, ptv_2, and ptv_3 indicate for each party that is available the leaning of each respondent to this party (in the real data, respondents of course lean more strongly to the party they voted for); scale: 0-1
party_to_ptv is a conversion list that indicates which party in vote_choice corresponds to which ptv_* column

Problem Description

I now need to calculate a set of (3) new variables called electoral_opportunites_* where the asterisk is a placeholder for 1-3 refering to the three PTVs. The idea is to calculate the changes parties have of gaining new voters based on the favorable leaning of other parties' voters.

To do so, I need to calculate: 1 - (sqrt(PTV of party voted for) - sqrt(PTV of other party)), the idea of which is to set the strength of support of one's own party in relation to a new party. For example, if a respondent strongly supports their own party, A, by PTV = 1.0, it doesn't really matter that much that they also lean to B by PTV = 0.4.

My problem with the calculation is the conditionality: I need to find rowwise for each respondent the PTV column value that corresponds to their party of choice (which may not be the highest PTV value in the row), and then subtract from it the square-rooted value of another column.

Manually, I would do it as follows for the example df.

Expected Outcome (for electoral_opportunities_1)

df %>% 
  mutate(electoral_potential_1 = 
           # Subtract: PTV (party voted for) - PTV (PTV column 1)...
           c(1 - ( sqrt(0.799) - sqrt(0.691) ),
             1 - ( sqrt(0.810) - sqrt(0.544) ),
             1 - ( sqrt(0.794) - sqrt(0.289) ),
             1 - ( sqrt(0.440) - sqrt(0.147) ),
             1 - ( sqrt(0.754) - sqrt(0.963) ),
             NA, # ...unless they are both the same.
             NA,
             NA,
             NA,
             NA,
             1 - ( sqrt(0.220) - sqrt(0.478) ),
             1 - ( sqrt(0.352) - sqrt(0.318) ),
             1 - ( sqrt(0.668) - sqrt(0.415) ),
             1 - ( sqrt(0.418) - sqrt(0.414) ),
             NA,
             NA,
             NA,
             1 - ( sqrt(0.753) - sqrt(0.216) ),
             1 - ( sqrt(0.374) - sqrt(0.232) ),
             1 - ( sqrt(0.665) - sqrt(0.143) )) ) -> df

df

As a minor detail, I would afterwards check if there are any values > 1 and cap them at 1, which means that if respondents are leaning more strongly to a party they did not actually vote for, said party will receive the highest score (1) in terms of its eletoral changes to pursuade that voter.

df %>% 
  mutate(electoral_opportunities_1 = ifelse(electoral_opportunities_1 > 1, 1, electoral_opportunities_1)) -> df

I cannot do all of this by hand. Hence I would be grateful for an efficient and tidy solution to calculate the electoral opportunities for individual PTV columns. I have tried many different approaches, including pivoting the df, none of which have worked so far. Taken together, the process is:

Take the value of the PTV column that corresponds to vote_choice.
Subtract from this value the value of a given PTV column to calculate that party's electoral opportunities.
Unless both parties are the same, in which case set the value to NA.
Then, check if there are any values > 1 and cap them at 1.

EDIT

I just noticed that in the final df I would of course need the average electoral opportunity for each party in vote_choice, instead of three separate columns!

I got it to the 3 columns. So from those you need their average correct? Like rowwise average — Sotos, Jul 27 '23 at 12:45

Sotos · Answer 1 · 2023-07-27T13:05:47.337

Continuing form your solution,

df %>%
  mutate(
    ptv_v = case_when(
      party_to_ptv == "ptv_1" ~ ptv_1,
      party_to_ptv == "ptv_2" ~ ptv_2,
      party_to_ptv == "ptv_3" ~ ptv_3,
      TRUE ~ NA_real_
    ),
    opportunity_1 = ifelse(ptv_1 > ptv_v, 1, 1 - (sqrt(ptv_v) - sqrt(ptv_1))),
    opportunity_2 = ifelse(ptv_2 > ptv_v, 1, 1 - (sqrt(ptv_v) - sqrt(ptv_2))),
    opportunity_3 = ifelse(ptv_3 > ptv_v, 1, 1 - (sqrt(ptv_v) - sqrt(ptv_3))),
  ) %>%
  mutate_at(vars(starts_with("opportunity")), ~ifelse(party_to_ptv == substr(., start = 14, stop = 18), NA, .)) %>%
  group_by(vote_choice) %>%
  summarise(avg_opportunity = mean(c(opportunity_1, opportunity_2, opportunity_3), na.rm = TRUE))

which now gives,

vote_choice avg_opportunity
  <chr>                 <dbl>
1 PartyA                0.962
2 PartyB                0.813
3 PartyC                0.836
4 PartyD                0.937
5 PartyE                0.759
6 PartyF                0.816

Initial attempt

library(tidyverse) 

df %>%
  pivot_longer(cols = starts_with("ptv"), 
               names_to = "ptv", 
               values_to = "ptv_value") %>%
  group_by(respondent_id) %>%
  mutate(voted_party_ptv = ptv_value[party_to_ptv == ptv]) %>%
  ungroup() %>%
  mutate(electoral_opportunity = ifelse(party_to_ptv != ptv, 
                                        pmin(1, 1 - (sqrt(voted_party_ptv) - sqrt(ptv_value))), 
                                        NA)) %>% 
  select(-c(voted_party_ptv, ptv_value)) %>%
  pivot_wider(names_from = ptv, 
              values_from = electoral_opportunity, 
              names_prefix = "electoral_opportunity_") %>%
  mutate(avg_electoral_opportunity = rowMeans(select(., starts_with("electoral_opportunity")), na.rm = TRUE))

which gives:

vote_choice country party_to_ptv respondent_id electoral_opportunity_ptv_1 electoral_opportunity_ptv_2 electoral_opportunity_pt…¹ avg_e…²
   <chr>       <chr>   <chr>                <int>                       <dbl>                       <dbl>                      <dbl>   <dbl>
 1 PartyA      A       ptv_2                   10                       0.937                      NA                          0.949   0.943
 2 PartyB      A       ptv_3                    4                       0.838                       0.583                     NA       0.710
 3 PartyB      A       ptv_3                    6                       0.647                       0.625                     NA       0.636
 4 PartyB      A       ptv_3                    7                       0.720                       1                         NA       0.860
 5 PartyB      A       ptv_3                    8                       1                           0.346                     NA       0.673
 6 PartyC      A       ptv_1                    1                      NA                           0.610                      0.526   0.568
 7 PartyC      A       ptv_1                    2                      NA                           0.580                      1       0.790
 8 PartyC      A       ptv_1                    3                      NA                           0.531                      0.828   0.679
 9 PartyC      A       ptv_1                    5                      NA                           0.912                      1       0.956
10 PartyC      A       ptv_1                    9                      NA                           0.715                      0.843   0.779
11 PartyD      B       ptv_3                   13                       1                           0.986                     NA       0.993
12 PartyD      B       ptv_3                   16                       0.971                       1                         NA       0.985
13 PartyD      B       ptv_3                   19                       0.827                       0.491                     NA       0.659
14 PartyD      B       ptv_3                   20                       0.997                       0.973                     NA       0.985
15 PartyE      B       ptv_1                   11                      NA                           0.458                      0.140   0.299
16 PartyE      B       ptv_1                   12                      NA                           1                          1       1    
17 PartyE      B       ptv_1                   14                      NA                           0.487                      0.746   0.616
18 PartyF      B       ptv_2                   15                       0.597                      NA                          0.915   0.756
19 PartyF      B       ptv_2                   17                       0.870                      NA                          0.722   0.796
20 PartyF      B       ptv_2                   18                       0.563                      NA                          0.678   0.621

You can omit any columns you don't need

Yes, nice. I was trying something similar and then noticed it's actually simpler without the pivoting (see my aswer). However, I realized that what I have in the columns (`electoral_opportunity`) I actually need to assign as averages to the individual parties in `vote_choice`). Does that make sense? — Dr. Fabian Habersack, Jul 27 '23 at 12:55
So 1 average for 1 vote choice? (in other words doing `summarise()`)? — Sotos, Jul 27 '23 at 12:57
Oh that's so interesting. How did you get different average opportunities on the party-level? Looks good, but now I need to find out where we went into different directions. You can also have a look a my solution below, if you want to. :) — Dr. Fabian Habersack, Jul 27 '23 at 13:20

Dr. Fabian Habersack · Accepted Answer · 2023-07-27T13:23:22.217

Okay, it is apparently more straight forward than I initially thought. Here is what I have done to solve the biggest part of the problem.

library(tidyverse)

df %>% 
  mutate(ptv_v = case_when(party_to_ptv == "ptv_1" ~ ptv_1,
                                    party_to_ptv == "ptv_2" ~ ptv_2,
                                    party_to_ptv == "ptv_3" ~ ptv_3,
                                    T ~ NA_real_)) %>% 
  mutate(electoral_opportunity_1 = ifelse(ptv_1 > ptv_v, 1, 1 - (sqrt(ptv_v) - sqrt(ptv_1)) ) %>% ifelse(party_to_ptv == "ptv_1", NA, .),
         electoral_opportunity_2 = ifelse(ptv_2 > ptv_v, 1, 1 - (sqrt(ptv_v) - sqrt(ptv_2)) ) %>% ifelse(party_to_ptv == "ptv_2", NA, .),
         electoral_opportunity_3 = ifelse(ptv_3 > ptv_v, 1, 1 - (sqrt(ptv_v) - sqrt(ptv_3)) ) %>% ifelse(party_to_ptv == "ptv_3", NA, .) ) -> df

Now I just need to get the average electoral opportunity for each party in vote_choice. It's tricky and I'm still triyng to figure out what I actually want. This is all a bit clumsy but I think it does what I want:

df %>%
  mutate(opportunity = case_when(is.na(electoral_opportunity_1) ~ mean(electoral_opportunity_1, na.rm = T),
                                 is.na(electoral_opportunity_2) ~ mean(electoral_opportunity_2, na.rm = T),
                                 is.na(electoral_opportunity_3) ~ mean(electoral_opportunity_3, na.rm = T),
                                 T ~ NA_real_)) -> df

df %>% 
  group_by(vote_choice, country) %>%
  summarize(opportunity = mean(opportunity, na.rm = T))

  vote_choice country opportunity
  <chr>       <chr>         <dbl>
1 PartyA      A             0.706
2 PartyB      A             0.779
3 PartyC      A             0.830
4 PartyD      B             0.779
5 PartyE      B             0.830
6 PartyF      B             0.706

hah! Did I overkill it in mine? I have been thinking it for a while so maybe I got a bit burned up :) — Sotos, Jul 27 '23 at 12:54
Haha, what you have looks pretty cool, but I guess it's actually simpler than I initially thought. Just need to figure out a way to assign the average electoral opportunities to the parties in vote_choice now. :) — Dr. Fabian Habersack, Jul 27 '23 at 12:59
I updated my answer continuing your train of thought. Have a look and let me know — Sotos, Jul 27 '23 at 13:06

Conditional calculation of new variable based on specific groups of rows, row values and columns in R dataframe

2 Answers2