R: Replacing Strings with their Most Common Variant

Question

I'm looking to standardise a set of manually inputted strings, so that:

index   fruit
1   Apple Pie
2   Apple Pie.
3   Apple. Pie
4   Apple Pie
5   Pear

should look like:

index   fruit
1   Apple Pie
2   Apple Pie
3   Apple Pie
4   Apple Pie
5   Pear

For my use case, grouping them by phonetic sound is fine, but I'm missing the piece on how to replace the least common strings with the most common ones.

library(tidyverse)  
library(stringdist)

index <- seq(1,5,1)
fruit <- c("Apple Pie", "Apple Pie.", "Apple. Pie", "Apple Pie", "Pear")

df <- data.frame(index, fruit) %>%
  mutate(grouping = phonetic(fruit)) %>%
  add_count(fruit) %>%
  # Missing Code
  select(index, fruit)

akrun · Answer 1 · 2019-06-18T14:31:14.723

We can use str_remove to remove the .

library(dplyr)
library(stringr)
data.frame(index, fruit) %>% 
    mutate(fruit = str_remove(fruit, "\\."))
# index     fruit
#1     1 Apple Pie
#2     2 Apple Pie
#3     3 Apple Pie
#4     4 Apple Pie
#5     5      Pear

If we need to use phonetic and find the most frequent value

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}


data.frame(index, fruit) %>%
   mutate(grouping = phonetic(fruit)) %>%
   group_by(grouping) %>% 
   mutate(fruit = Mode(fruit))
# A tibble: 5 x 3
# Groups:   grouping [2]
#  index fruit     grouping
#  <dbl> <fct>     <chr>   
#1     1 Apple Pie A141    
#2     2 Apple Pie A141    
#3     3 Apple Pie A141    
#4     4 Apple Pie A141    
#5     5 Pear      P600

Is there any flexibility benefit to creating the `Mode` function? — rsylatian, Jun 18 '19 at 14:43

BENY · Accepted Answer · 2019-06-18T14:33:53.340

Sounds like you need group_by the grouping, then select the most frequent (Mode) item

df%>%mutate(grouping = phonetic(fruit))%>%
     group_by(grouping)%>%
     mutate(fruit = names(which.max(table(fruit))))

# A tibble: 5 x 3
# Groups:   grouping [2]
  index     fruit grouping
  <dbl>    <fctr>    <chr>
1     1 Apple Pie     A141
2     2 Apple Pie     A141
3     3 Apple Pie     A141
4     4 Apple Pie     A141
5     5      Pear     P600

score 1 · Answer 3 · answered Jun 18 '19 at 14:41

Another way could be:

fruit %>%
 enframe() %>%
 mutate(grouping = phonetic(fruit)) %>%
 add_count(value, grouping) %>%
 group_by(grouping) %>%
 mutate(value = value[match(max(n), n)]) %>%
 select(-n) %>%
 ungroup()

   name value     grouping
  <int> <chr>     <chr>   
1     1 Apple Pie A141    
2     2 Apple Pie A141    
3     3 Apple Pie A141    
4     4 Apple Pie A141    
5     5 Pear      P600

R: Replacing Strings with their Most Common Variant

3 Answers3