duplicated() in combination with mutate() and lapply() on lists of characters: duplicate identification

Question

I want to identify duplicated characters in grouped lists

Consider the following example data frame:

ID<-c("Carl", "Carl","Carl","Peter","Peter","Peter")
Question<-c("need","need","need","dyadic","dyadic","dyadic")
V1<-c("A1","A2","C0","A3","A3","A1")
df<-data.frame(ID,Question,V1)

I am using the following code to list V1 characters per group

df |>
    summarize(present_codes = list(V1), .by = c(ID, Question))

And would like the output to be a new column identifying the duplicated characters ('duplicated_codes') within each grouped list as below:

ID	Question	present_codes	duplicated_codes
Carl	need	c("A1", "A2", "C0")	character(0)
Peter	dyadic	c("A3", "A3", "A1")	c("A3")

I am trying to use a combination of mutate(), lapply() and x[duplicated(x)], but am getting the error message '"FUN" is missing', when running the line below - though x[duplicated(x)] works on single vectors. I am very new to Tidyverse and lapply language, so I am probably just making some simple error. The actual dataset has >40,000 rows.

 |>
    mutate(duplicated_codes=lapply(x=present_codes,x[duplicated(x)]))

Thanks a lot in advance!

jpsmith · Answer 1 · 2023-06-25T17:09:03.053

3

One option would be to first identify duplicates:

dupes <- df %>% 
  filter(duplicated(V1), .by = c(ID, Question)) %>%
  rename(duplicated_codes = V1)

Then building off of your existing code, simply add a dplyr::left_join statement:

df %>%
  summarize(present_codes = list(V1), .by = c(ID, Question)) %>%
  left_join(dupes)

Output:

     ID Question present_codes duplicated_codes
1  Carl     need    A1, A2, C0             <NA>
2 Peter   dyadic    A3, A3, A1               A3

Or all in one go, per the comment from @Ben:

df |> summarise(present_codes = list(V1), 
                duplicated_codes = list(V1[duplicated(V1)]), 
                .by = c(ID, Question))

edited Jun 25 '23 at 17:09

answered Jun 23 '23 at 17:19

jpsmith

11,023
5
15
36

Could you consider simplifying in a single `summarise` statement, such as: `df |> summarise(present_codes = list(V1), duplicated_codes = list(V1[duplicated(V1)]), .by = c(ID, Question))`? – Ben Jun 25 '23 at 01:25
@Ben Thanks! I included your suggestion in the edit. – jpsmith Jun 25 '23 at 17:09

duplicated() in combination with mutate() and lapply() on lists of characters: duplicate identification

1 Answers1