3

I'm trying to come up with an elegant, rule-based way to assign codes to rows in a data frame based on combinations of values in columns, using this data:

library(tidyr)
df <- crossing(yr2018=c("M","S","W"),
               yr2019=c("M","S","W"),
                yr2020=c("M","S","W")) %>%
  print(n=27)

# A tibble: 27 × 3
   yr2018 yr2019 yr2020
   <chr>  <chr>  <chr> 
 1 M      M      M     
 2 M      M      S     
 3 M      M      W     
 4 M      S      M     
 5 M      S      S     
 6 M      S      W     
 7 M      W      M     
 8 M      W      S     
 9 M      W      W     
10 S      M      M     
11 S      M      S     
12 S      M      W     
13 S      S      M     
14 S      S      S     
15 S      S      W     
16 S      W      M     
17 S      W      S     
18 S      W      W     
19 W      M      M     
20 W      M      S     
21 W      M      W     
22 W      S      M     
23 W      S      S     
24 W      S      W     
25 W      W      M     
26 W      W      S     
27 W      W      W     
>

What I want to end up with is a column with codes applied with rules such the following:

  • if all 3 values in yr2018, yr2019, and yr2020 are the same (MMM, SSS, or WWW), then set the new column value to the concatenation of "CON" and whatever the unique value is, so either "CONM", "CONS", or "CONW".
  • if the first and third columns are the same, but the second is different, then concatenate the two unique values together as exactly "MS","MW", or "SW", in that order, depending on which two unique values are in the row, regardless of the order of the values in the columns.
  • if all three are different, regardless of order, then "MSW"
  • if the last two are the same, but different from the first, then concatenate "CON" with the last value, so either "CONM", "CONS", or "CONW"
  • lastly, if the first two are the same and the last different, then concatenate "CON" with the first column, so either "CONM", "CONS", or "CONW"

This feels like a big, ugly if statement, but I'm hoping for something more elegant, especially since my real data is actually 4x5 (625 rows). It also feels like maybe regular expressions, which I struggle with.

I started looking into row-wise functions and found rowwise() as a start to logically reconfigure the data frame, but it looks like the number of functions that can operate that way are limited.

All guidance welcome!

E Maas
  • 115
  • 7
  • 2
    seems that you only have 2 conditions, if at least two consecutive are the same then replace with CON_value otherwise get unique – Onyambu Jul 19 '23 at 00:14
  • @Onyambu, that's a great observation, though it's actually three conditions, as the last two conditions are two different variations. The list of possible of output values is the same, yes, but the last value appended to "CON" is based on either the first column or the last, depending. – E Maas Jul 19 '23 at 12:41

4 Answers4

4

You can use mutate and case_when to efficiently satisfy these conditions. sort in the second logic will organize the letters as you described.

Since case_when evaluates iteratively, you may be able to parse this down to make it more elegant, but as written it should follow your exact conditions:

library(dplyr)

df %>%
  rowwise() %>%
  mutate(new_column = case_when(
    yr2018 == yr2019 & yr2019 == yr2020 ~ paste0("CON", yr2018),
    yr2018 == yr2020 ~ paste(sort(c(yr2019, yr2020)), collapse = ""),
    yr2018 != yr2019 & yr2019 != yr2020 & yr2018 != yr2020 ~ "MSW",
    yr2019 == yr2020 & yr2018 != yr2020 ~ paste0("CON", yr2020),
    yr2018 == yr2019 & yr2018 != yr2020 ~ paste0("CON", yr2018)
  )) 

Output:

   yr2018 yr2019 yr2020 new_column
   <chr>  <chr>  <chr>  <chr>     
 1 M      M      M      CONM      
 2 M      M      S      CONM      
 3 M      M      W      CONM      
 4 M      S      M      MS        
 5 M      S      S      CONS      
 6 M      S      W      MSW       
 7 M      W      M      MW        
 8 M      W      S      MSW       
 9 M      W      W      CONW      
10 S      M      M      CONM      
11 S      M      S      MS        
12 S      M      W      MSW       
13 S      S      M      CONS      
14 S      S      S      CONS      
15 S      S      W      CONS      
16 S      W      M      MSW       
17 S      W      S      SW        
18 S      W      W      CONW      
19 W      M      M      CONM      
20 W      M      S      MSW       
21 W      M      W      MW        
22 W      S      M      MSW       
23 W      S      S      CONS      
24 W      S      W      SW        
25 W      W      M      CONW      
26 W      W      S      CONW      
27 W      W      W      CONW
jpsmith
  • 11,023
  • 5
  • 15
  • 36
  • 1
    this is a very readable solution and would be easy to maintain. However, the non-"CON_" output should be ONLY "MS", "MW", "SW", or "MSW", with the values in that order in each code, regardless of their order across the columns. – E Maas Jul 19 '23 at 13:30
  • @EMaas - gotcha - see edit where I used `gtools::mixedsort()` to ensure the values are "MS", "MW", "SW", or "MSW" – jpsmith Jul 19 '23 at 13:52
  • It looks like the sorting worked on these first 10 rows, but breaks down in rows 11, 21, and 24 ("SM", "WM", and "WS"), as if the sorting algorithm just flips the order of the values in columns yr2019 and yr2020. – E Maas Jul 19 '23 at 16:01
  • I also cannot get stri_sort to work. This works: ```> stringi::stri_sort(c("S","M")) [1] "M" "S"``` but this doesn't: ```> stringi::stri_sort(paste("S","M")) [1] "S M"``` Getting the former format to run in the case statement is eluding me. ```stringi::stri_sort(c(yr2019, yr2020))``` results in an error "`new_column` must be size 1, not 2." I presume this is because stri_sort is returning a 2-element character vector. If I paste them together with ```paste(stringi::stri_sort(c(yr2019, yr2020)),collapse="")```, I get "MMMMMMMMMMMMMMMMMMSSSSSSSSSSSSSSSSSSWWWWWWWWWWWWWWWWWW" as the value. – E Maas Jul 19 '23 at 16:21
  • @EMaas hmm - try the edit, which sorts before pasting but requires `rowwise()`- it appears to work and does not require additional packages – jpsmith Jul 19 '23 at 16:23
  • 1
    That's it! Thank you very much. – E Maas Jul 19 '23 at 17:33
4

You could use str_replace:

df %>%
 mutate(new_column = str_replace(exec(str_c, !!!.),".*?(.)\\1+.*", "CON\\1")%>%
     str_replace('((.).)\\2', "\\1"))

# A tibble: 27 × 4
   yr2018 yr2019 yr2020 new_column
   <chr>  <chr>  <chr>  <chr>     
 1 M      M      M      CONM      
 2 M      M      S      CONM      
 3 M      M      W      CONM      
 4 M      S      M      MS        
 5 M      S      S      CONS      
 6 M      S      W      MSW       
 7 M      W      M      MW        
 8 M      W      S      MWS       
 9 M      W      W      CONW      
10 S      M      M      CONM    

You could also use gsubfn::gsubfn:

 df %>%
   mutate(newcol = gsubfn::gsubfn(".*(.)\\1+.*|((.).)\\3", 
                     function(x,y,z)if(nzchar(z))y else str_c('CON', x), 
                           exec(str_c, !!!.), backref = -3))
# A tibble: 27 × 4
   yr2018 yr2019 yr2020 newcol
   <chr>  <chr>  <chr>  <chr> 
 1 M      M      M      CONM  
 2 M      M      S      CONM  
 3 M      M      W      CONM  
 4 M      S      M      MS    
 5 M      S      S      CONS  
 6 M      S      W      MSW   
 7 M      W      M      MW    
 8 M      W      S      MWS   
 9 M      W      W      CONW  
10 S      M      M      CONM  
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • 1
    Jesus Christ that's elegant! – jpsmith Jul 19 '23 at 00:35
  • 1
    @jpsmith I would still go with the `case_when`. Was running your code. Wondering why you need the `rowwise` – Onyambu Jul 19 '23 at 00:36
  • 1
    @jpsmith tried reducing it using `gsubfn` – Onyambu Jul 19 '23 at 01:03
  • haven’t heard of the `gsubfn` package - thanks for letting me know! – jpsmith Jul 19 '23 at 02:30
  • @Onyambu, this is close, but other than "CON_", the new values should always be exactly "MS", "MW" "SW", or "MSW", regardless of the order of the values in the columns. So rows 4, 6, 7 are correct, but 8 should be "MSW". I edited my post to make that more clear. – E Maas Jul 19 '23 at 12:56
  • 1
    @EMaas I am quite sure before your edit you clearly stated that the order did not matter, Seems it does. Well before you run the code, just sort the data rowwise. and then run the code. – Onyambu Jul 19 '23 at 13:11
  • Clever. We can use formula notation for function, eliminate backref since it is the default and define an aux function `Paste0` like this: `library(dplyr); library(gsubfn); Paste0 <- function(...) do.call(paste0, data.frame(...)); df %>% mutate(newcol = gsubfn(".*(.)\\1+.*|((.).)\\3", x+y+z ~ if(nzchar(z))y else paste0('CON', x), Paste0(.)))` – G. Grothendieck Jul 19 '23 at 14:08
  • @Onyambu, you're correct, but the "order didn't matter" was meant to refer to the data in the columns. The output was meant to be limited to the list of codes I provided. A quick sort does seem to address the issue. – E Maas Jul 19 '23 at 15:14
2

A way could be to use rle and in case there are consecutive paste this to CON otherwise sort the unique values.

sapply(apply(df, 1, rle, simplify = FALSE), \(x)
       if(is.na(i <- which(x$lengths > 1)[1]))
           paste(sort(unique(x$values)), collapse="")
       else  paste0("CON", x$value[i]) )
# [1] "CONM" "CONM" "CONM" "MS"   "CONS" "MSW"  "MW"   "MSW"  "CONW" "CONM"
#[11] "MS"   "MSW"  "CONS" "CONS" "CONS" "MSW"  "SW"   "CONW" "CONM" "MSW" 
#[21] "MW"   "MSW"  "CONS" "SW"   "CONW" "CONW" "CONW"
GKi
  • 37,245
  • 2
  • 26
  • 48
  • rle is new to me. what a great function! I was thinking an *apply approach could make sense. – E Maas Jul 19 '23 at 13:37
-1

This may be a better solution for larger data frames since we can fine-tune each rule, besides the case_when and str_replace answers already posted. The gather() method can convert the data frame into a longer format with 3 columns, adding a 'year' and 'value'. Then we can use rowwise to apply conditions to each row. The 5 rules are applied with case_when based on our new columns year and value. Then ungroup() data back into its original form.

library(tidyr)
library(dplyr)

df <- crossing(yr2018 = c("M", "S", "W"),
               yr2019 = c("M", "S", "W"),
               yr2020 = c("M", "S", "W"))

df <- df %>%
  gather(year, value) %>%
  rowwise() %>%
  mutate(new_column = case_when(
    # Rule 1: All three values are the same
    all(value == value[1]) ~ paste0("CON", value[1]),

    # Rule 2: First and third columns are the same
    year[1] == year[3] & value[1] != value[2] ~ paste0(value[1], value[2]),
    year[1] == year[3] & value[1] != value[3] ~ paste0(value[1], value[3]),
    year[2] == year[3] & value[2] != value[1] ~ paste0(value[2], value[1]),

    # Rule 3: All three values are different
    all(value != value[1]) ~ "MSW",

    # Rule 4: Last two values are the same
    value[2] == value[3] & value[1] != value[2] ~ paste0("CON", value[2]),
    value[1] == value[3] & value[1] != value[2] ~ paste0("CON", value[3]),

    # Rule 5: First two values are the same
    value[1] == value[2] & value[1] != value[3] ~ paste0("CON", value[1])
  )) %>%
  ungroup() %>%
  select(-year, -value)

print(df)
hygtfrde
  • 130
  • 3
  • I like the "all" function! It's new to me. I think this might suffer from the same issue as most of the other suggestions, though, where the order of the 2-value codes should be fixed as exactly "MS", "MW" or "SW". So for example, if the column data are either "SWS" or "WSW", the code should be "SW". – E Maas Jul 19 '23 at 13:06