1

I have a data frame with IDs and string values, of which some I prefer over others:

library(dplyr)

d1<-data.frame(id=c("a", "a", "b", "b"), 
           value=c("good", "better", "good", "good"))

I wand to handle that equivalent to the following example with numbers:

d2<-data.frame(id=c("a", "a", "b", "b"), 
           value=c(1, 2, 1, 1))

d2 %>% group_by(id) %>%
       summarize(max(value))

So if an ID has multiple values, I will always get the highest number for each ID:

# A tibble: 2 x 2
  id    `max(value)`
  <fct>        <dbl>
1 a                2
2 b                1

Equivalent, if an ID has multiple strings, I always want to extract the preferred string for the d1 dataframe: If we have "good", use that row, if another row has "better" use that row instead, thus eliminating duplicated IDs.

The example is arbitrary, could also be >>if we have "yes" and "unknown" then take "yes", else take "unknown"<<

So is there an "extract best string" function for the dplyr::summarize() function?

The result should look like this:

id | value
----------
"a"| "better"
"b"| "good"
aldorado
  • 4,394
  • 10
  • 35
  • 46
  • 1
    You might find this helpful https://stackoverflow.com/questions/67179036/remove-a-column-with-bad-quality-data-if-there-is-good-quality-data-for-the-same – Elle Apr 20 '21 at 14:30

2 Answers2

2

you can try a factor approach. First you need an ordered vector of your strings like:

my_levels <- c("better", "good")

Then you change the levels accordingly, transform to numeric, summarize and transform back.

d1 %>% 
  mutate(value_num = factor(value, levels = my_levels) %>% as.numeric) %>% 
  group_by(id) %>%
  summarize(res = min(value_num)) %>% 
  mutate(res_fac = factor(res, labels = my_levels))
# A tibble: 2 x 3
id      res res_fac
<chr> <dbl> <fct>  
1 a         1 better 
2 b         2 good
Roman
  • 17,008
  • 3
  • 36
  • 49
1

Similar to @roman s answer, but using the data.table package you could do the following to filter the "better" rows:

require(data.table)
setDT(d1)

# convert value to factor
d1[ , value := factor(value, levels = c('better', 'good'))]
# return first ordered value by each id group
d1[ , .SD[order(value)][1], id]
andschar
  • 3,504
  • 2
  • 27
  • 35