2

I have a table containing two rows for each ID.

table <- tibble(
  id = c(1,1,2,2,3,3,4,4,5,5),
  row1 = c(2,5,2,5,1,3,2,5,3,2),
  row2 = c("foo", "other foo", "bar", "bar", "bar", "bar other", "other", "foo", "other", "other")
)
> table
# A tibble: 10 × 3
      id  row1 row2 
   <dbl> <dbl> <chr>
 1     1     2 foo  
 2     1     5 other foo
 3     2     2 bar  
 4     2     5 bar  
 5     3     1 bar
 6     3     3 bar other
 7     4     2 other
 8     4     4 foo  
 9     5     3 other
10     5     2 other

I would like to resolve the table to a single row for each ID based on three rules in succession:

  1. If, for each ID, there is one row in which row1 is 5 or more, then choose the row where row1 is less than 5.
  2. Else if, for each ID, there is a row in which row2 contains the word 'other', choose the row where row2 does not contain the word 'other'
  3. Otherwise, for each ID, pick the first row.

I feel there must be a more straightforward way of doing this. This is my attempt so far, but I've can't work out how to resolve the NA to return 'bar'.

table %>%
  group_by(id) %>%
  summarise(
    row1 = ifelse(max(row1) >= 5,
            first(row1[row1 < 5]),
            ifelse(
              grep("other", row2),
              ifelse(
                !is.na(first(row1[grep("other", row2, invert = T)])),
                first(row1[grep("other", row2, invert = T)]),
                first(row1)),
              first(row1))
    ),
    row2 = ifelse(
      max(row1) >= 5,
      first(row2[row1 < 5]),
      ifelse(
        grep("other", row2),
        ifelse(
          !is.na(first(row2[grep("other", row2, invert = T)])),
          first(row2[grep("other", row2, invert = T)]),
          first(row2)),
        first(row2)
    )
)
)

# A tibble: 5 × 3
     id  row1 row2 
  <dbl> <dbl> <chr>
1     1     2 foo  
2     2     2 NA   
3     3     1 bar  
4     4     2 foo  
5     5     3 other

Desired output:

id row1 row2
1 2 foo
2 2 bar
3 1 bar
4 2 other
5 3 other

Many thanks for your help.

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51

4 Answers4

3

Here is how we can do it:

library(dplyr)
library(tidyr)
library(stringr)

table %>%
  group_by(id) %>% 
  separate_rows(row2) %>%
  mutate(x = ifelse(row1>=5, min(row1),NA),
         y = ifelse(str_detect(row2, 'other'), !str_detect(row2, 'other'), NA)) %>% 
  slice(1) %>% 
  select(-c(x, y))
    id  row1 row2 
  <dbl> <dbl> <chr>
1     1     2 foo  
2     2     2 bar  
3     3     1 bar  
4     4     2 other
5     5     3 other
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • I find something confusing. Could you try `table$row2[5:6] <- c("bar other", "other")`? What do you think which row should be returned for `id` 3 with OP's rule? – Darren Tsai May 07 '22 at 19:18
  • 1
    I think none of them because all contain `other` therefore I thought to separate?! – TarJae May 07 '22 at 19:26
  • 1
    But I think `"bar other"` should be returned, because this case doesn't meet the first and second rules, and hence should go to the third one, i.e. for each `id`, pick the first row. I'm still not sure for this. Good work! – Darren Tsai May 07 '22 at 19:35
  • Thank you all - this answer is the one that meets my requirements (and with the correct output for ID=3). They've all taught me much along the way! – Jamie Scuffell May 09 '22 at 07:12
2

Here is one solution, that leverages this small function, f() using tidyverse or data.table

f <- function(r1,r2) {
  if(sum(r1>=5)==1) return(list("row1" =r1[r1<5], "row2"=r2[r1<5]))
  if(sum(grepl("other",r2))==1) return(list("row1" = r1[!grepl("other",r2)], "row2"=r2[!grepl("other",r2)]))
  list("row1"=r1[1],"row2"=r2[1])
}

Usage

library(tidyverse)

table %>% 
  group_by(id) %>%
  summarize(n=list(f(row1,row2))) %>%
  unnest_wider(n)

or

library(data.table)

setDT(table)[, f(row1,row2), by=id]

Output:

      id  row1   row2
   <num> <num> <char>
1:     1     2    foo
2:     2     2    bar
3:     3     1    bar
4:     4     2  other
5:     5     3  other
langtang
  • 22,248
  • 1
  • 12
  • 27
  • The original data was changed, so I'm updating the output, not the solution, which I believe remains correct. – langtang May 07 '22 at 19:48
2
table %>%
  group_by(id) %>%
  subset(
    case_when(
      any(row1 >= 5) ~ row1 < 5,
      any(grepl("other", row2)) ~ !grepl("other", row2),
      T ~ T
    )
  ) %>%
  filter(row_number() == 1) %>%
  ungroup()

This answer takes advantage of dplyr's grouping abilities to check for any() within each group, so it gets easy to know if a certain condition happens within a group.

It also uses case_when() to check for a series of conditions in a prioritized order, implementing what would be a series of if/else's.

Finally, since in whatever case we would like only the first row that matches the criteria, it uses the function row_number() to check whether we're on the first row within the group, in order to select it.

Output is:

# A tibble: 5 x 3
     id  row1 row2     
  <dbl> <dbl> <chr>    
1     1     2 foo      
2     2     2 bar      
3     3     1 bar other
4     4     2 other    
5     5     3 other    
> 
user2332849
  • 1,421
  • 1
  • 9
  • 12
  • Thank you for this one - I've put this as the most useful answer as it was a more generalised solution to the problem using dplyr's functionality only. Thanks to all the responses. – Jamie Scuffell May 12 '22 at 09:50
1

A dplyr solution:

table %>%
  group_by(id) %>%
  filter(row1 < 5 | n_distinct(row1 < 5) == 1) %>%
  filter(!grepl("other", row2) | n_distinct(grepl("other", row2)) == 1) %>%
  slice(1) %>% ungroup()

# # A tibble: 5 × 3
#      id  row1 row2 
#   <dbl> <dbl> <chr>
# 1     1     2 foo  
# 2     2     2 bar  
# 3     3     1 bar  
# 4     4     2 other
# 5     5     3 other

n_distinct(...) == 1 is used to determine if a condition is all TRUE or all FALSE.

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
  • `row2` third row `bar other` does not match the desired output: Maybe `separate_rows`? – TarJae May 07 '22 at 18:54
  • 1
    @TarJae The problem results from the OP's typo of code to create the data. You can check the OP's data. – Darren Tsai May 07 '22 at 19:02
  • 1
    Looks like the data were changed. I updated my output, but not my solution, to reflect a change in the OP's data – langtang May 07 '22 at 19:50