2

I am trying to find partial duplicates from a list in a column in an aggregated dataset in R. For exmaple my column is like this

|fruit| :-----: |apple,banana,orange| |melon,pineapple,grapes,kiwi| |coconut,papaya| |mango| |apple,banana,orange,coconut,papaya| |mango,melon,pineapple,grapes,kiwi|

I want to create a column for identifying the partial duplicates in the group. For example the output for the above table should be:

fruit unique_id
apple,banana,orange 1
melon,pineapple,grapes,kiwi 2
coconut,papaya 3
mango 4
apple,banana,orange,coconut,papaya 1,3
mango,melon,pineapple,grapes,kiwi 2,4

So the unique id will have n number of items depending on how many duplicates there are of it. Is there a way to do this with dplyr? I need the code to automatically find the partial duplicates and assign a unique id, so str_detect or grepl isn't helpful as I would have to give it a pattern to search which isn't feasible with the size of my database. Any help would be appreciated.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
Aila
  • 21
  • 2
  • If `fruit` is a string I don't think it will be easy to do better than a function like `grep` or the [`chmatch`](https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/chmatch) family. You could also try [`stringi`](https://stackoverflow.com/questions/24257850/fast-partial-string-matching-in-r). – Raisin Apr 20 '22 at 20:35
  • Yes it is a string, I'm not sure how to do grep without giving it the pattern, but for it to automatically find it. Perhaps make the fruits column into a list of lists and then work from within it for repeated counts? – Aila Apr 20 '22 at 22:26

0 Answers0