I am trying to find partial duplicates from a list in a column in an aggregated dataset in R. For exmaple my column is like this
|fruit| :-----: |apple,banana,orange| |melon,pineapple,grapes,kiwi| |coconut,papaya| |mango| |apple,banana,orange,coconut,papaya| |mango,melon,pineapple,grapes,kiwi|
I want to create a column for identifying the partial duplicates in the group. For example the output for the above table should be:
fruit | unique_id |
---|---|
apple,banana,orange | 1 |
melon,pineapple,grapes,kiwi | 2 |
coconut,papaya | 3 |
mango | 4 |
apple,banana,orange,coconut,papaya | 1,3 |
mango,melon,pineapple,grapes,kiwi | 2,4 |
So the unique id will have n number of items depending on how many duplicates there are of it. Is there a way to do this with dplyr? I need the code to automatically find the partial duplicates and assign a unique id, so str_detect
or grepl
isn't helpful as I would have to give it a pattern to search which isn't feasible with the size of my database. Any help would be appreciated.