3

I have a nested data.frame - df_nested, there one of column contains df:

df <- tibble(ID_Value = 1:8,
             xyz001 = c("text4", NA, NA, NA, NA, NA, NA, "text2"), 
             xyz002 = c(NA, NA, NA, "text3", "text1", NA, NA, NA),
             xyz003 = c(NA, "text1", NA, NA, "text2", NA, "text2", NA)) 

I want to find a way, how to mutate this df, on these requirements:

  1. mutate(across(matches("\\d")
  2. there are 4 cases - 4 levels by priority. text4 <- text3 <- text2 <- text1: I need to find and keep column values containing only the highest level text. e.g. if column contains text4, I want to remove text3, text2, text1 and replace them to NA. If it contains multiple highest order text, we should keep all these values (e.g. column xyz003).
  3. how to apply these conditions not specifying column name, because there can be any number in column name.
  4. if column contains all NAs, do nothing.

my attempt:

df_nested <- df_nested %>%
    mutate(df = map(data, ~.x %>%
       mutate(across(matches("\\dd"), function (x) {
                      conditions (ifelse, case_when or other)
                      ...}

Also, should we better use across(), or is vars() still a good way to do it as well? Thank you in advance.

Expected Output

df <- tibble(ID_Value = 1:8,
             xyz001 = c("text4", NA, NA, NA, NA, NA, NA, NA), 
             xyz002 = c(NA, NA, NA, "text3", NA, NA, NA, NA),
             xyz003 = c(NA, NA, NA, NA, "text2", NA, "text2", NA))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213

2 Answers2

1

You can use rowwise with c_across :

library(dplyr)
library(tidyr)

df %>%
  rowwise() %>%
  mutate(col = suppressWarnings(max(c_across(matches('\\d+')), na.rm = TRUE)))

#  ID_Value xyz001 xyz002 xyz003 col  
#     <int> <chr>  <chr>  <chr>  <chr>
#1        1 tier4  NA     NA     tier4
#2        2 NA     NA     tier1  tier1
#3        3 NA     NA     NA     NA   
#4        4 NA     tier3  NA     tier3
#5        5 NA     tier1  tier2  tier2
#6        6 NA     NA     NA     NA   
#7        7 NA     NA     tier2  tier2
#8        8 tier2  NA     NA     tier2

Taking max on character values doesn't make sense (and generates warnings) but here we can use it to get the output directly.


To keep only the max value in each row we can reshape the data :

df %>%
  pivot_longer(cols = -ID_Value) %>%
  group_by(ID_Value) %>%
  mutate(value = replace(value, -which.max(readr::parse_number(value)), NA)) %>%
  pivot_wider()

#  ID_Value xyz001 xyz002 xyz003
#     <int> <chr>  <chr>  <chr> 
#1        1 tier4  NA     NA    
#2        2 NA     NA     tier1 
#3        3 NA     NA     NA    
#4        4 NA     tier3  NA    
#5        5 NA     NA     tier2 
#6        6 NA     NA     NA    
#7        7 NA     NA     tier2 
#8        8 tier2  NA     NA    
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • tier4, tier3 2 and 1 can be any text. And I have added expected output. – Don el Fuego Integer Nov 09 '20 at 10:37
  • It should work as far as the text is the same for all the columns. I have updated the answer to produce output similar to the expected output and also used `parse_number` to deal with only number instead of text. – Ronak Shah Nov 09 '20 at 10:48
  • I think the better approach would be to name somehow all four cases, as I mentioned tierX can be any text. I gave the number just to show priority. And yes, the same 4 conditions (strings) are the same throughout columns. – Don el Fuego Integer Nov 09 '20 at 10:51
  • If number does not denote the priority then what does? How do we know which value to keep and which to turn `NA` ? – Ronak Shah Nov 09 '20 at 10:52
  • lets say tier4 could be = "string4", tier3 = "string3". I just meant, that first we check all values containing string3, and remove the rest. – Don el Fuego Integer Nov 09 '20 at 10:54
  • So there is still 4 and 3 in it, right? Did you try the answer , did you get any errors? – Ronak Shah Nov 09 '20 at 10:55
  • it does not work. there can be cases without number, just a plain text. also in your comment, it did not produced that i wanted. col xyz001 should have only tier4 values. – Don el Fuego Integer Nov 09 '20 at 11:01
1
  1. Use a factor type to specify the order you want.
  2. Do row- or column-wise operations across the matches.

Consider this function

max_only <- function(x, lvls) {
  fct <- droplevels(factor(x, lvls))
  `[<-`(x, as.integer(fct) != length(levels(fct)), NA_character_)
}

You can then specify any order you want

> max_only(c("apple", "banana", NA_character_), c("banana", "apple"))
[1] "apple" NA      NA     
> max_only(c("apple", "banana", NA_character_), c("apple", "banana"))
[1] NA       "banana" NA   

Case 1: column-wise operations

df %>% 
  mutate(across(matches("\\d"), max_only, c("tier1", "tier2", "tier3", "tier4")))

Output (this one looks more like your expected output)

# A tibble: 8 x 4
  ID_Value xyz001 xyz002 xyz003
     <int> <chr>  <chr>  <chr> 
1        1 tier4  NA     NA    
2        2 NA     NA     NA    
3        3 NA     NA     NA    
4        4 NA     tier3  NA    
5        5 NA     NA     tier2 
6        6 NA     NA     NA    
7        7 NA     NA     tier2 
8        8 NA     NA     NA    

Case 2: row-wise operations

df %>% 
  mutate(as.data.frame(t(apply(
    across(matches("\\d")), 1L, 
    max_only, c("tier1", "tier2", "tier3", "tier4")
  ))))

Output

# A tibble: 8 x 4
  ID_Value xyz001 xyz002 xyz003
     <int> <chr>  <chr>  <chr> 
1        1 tier4  NA     NA    
2        2 NA     NA     tier1 
3        3 NA     NA     NA    
4        4 NA     tier3  NA    
5        5 NA     NA     tier2 
6        6 NA     NA     NA    
7        7 NA     NA     tier2 
8        8 tier2  NA     NA    

Explanations

  1. [<- is almost equivalent to x[...] <- y; x. If ... is a logical vector (i.e. TRUE/FALSE), then values in x indexed by TRUE will be replaced by y. For example,

     > x <- c("a", "b" ,"c")
     > `[<-`(x, c(FALSE, TRUE, TRUE), NA_character_)
     [1] "a" NA  NA 
     > x[c(FALSE, TRUE, TRUE)] <- NA_character_; x
     [1] "a" NA  NA 
    
  2. NA_character_ is the NA value of a character type.

  3. as.integer(fct) != length(levels(fct)) returns a logical vector of the same length as fct. TRUE indexes positions where the values of fct are not the highest level, FALSE indexes the opposite, and NA indexes NAs. For example, assume that fct looks like this

     > x <- c("apple", "banana", NA)
     > fct <- droplevels(factor(x, c("apple", "banana", "pear")))
     > fct
     [1] apple  banana <NA>  
     Levels: apple banana
    

    Then, you can see that

     > as.integer(fct) != length(levels(fct))
     [1]  TRUE FALSE    NA 
    
  4. All together, it just means that we assign NA_character_s to values that are not equal to the highest level but keep NAs unchanged.

    [<-(x, as.integer(fct) != length(levels(fct)), NA_character_)
    
ekoam
  • 8,744
  • 1
  • 9
  • 22
  • I think it may be a solution. I will get back after investigation. I wish I knew factor(). Can you explain that is here, specially this combo of symbols `[<-`? `[<-`(x, as.integer(fct) != length(levels(fct)), NA_character_) – Don el Fuego Integer Nov 09 '20 at 11:18