Finding and labeling partial duplicates in a column in R?

Question

I am trying to find partial duplicates from a list in a column in an aggregated dataset in R. For exmaple my column is like this

I want to create a column for identifying the partial duplicates in the group. For example the output for the above table should be:

fruit	unique_id
apple,banana,orange	1
melon,pineapple,grapes,kiwi	2
coconut,papaya	3
mango	4
apple,banana,orange,coconut,papaya	1,3
mango,melon,pineapple,grapes,kiwi	2,4

So the unique id will have n number of items depending on how many duplicates there are of it. Is there a way to do this with dplyr? I need the code to automatically find the partial duplicates and assign a unique id, so str_detect or grepl isn't helpful as I would have to give it a pattern to search which isn't feasible with the size of my database. Any help would be appreciated.

If `fruit` is a string I don't think it will be easy to do better than a function like `grep` or the [`chmatch`](https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/chmatch) family. You could also try [`stringi`](https://stackoverflow.com/questions/24257850/fast-partial-string-matching-in-r). — Raisin, Apr 20 '22 at 20:35
Yes it is a string, I'm not sure how to do grep without giving it the pattern, but for it to automatically find it. Perhaps make the fruits column into a list of lists and then work from within it for repeated counts? — Aila, Apr 20 '22 at 22:26

Finding and labeling partial duplicates in a column in R?

0 Answers0