2

I'm looking to standardise a set of manually inputted strings, so that:

index   fruit
1   Apple Pie
2   Apple Pie.
3   Apple. Pie
4   Apple Pie
5   Pear

should look like:

index   fruit
1   Apple Pie
2   Apple Pie
3   Apple Pie
4   Apple Pie
5   Pear

For my use case, grouping them by phonetic sound is fine, but I'm missing the piece on how to replace the least common strings with the most common ones.

library(tidyverse)  
library(stringdist)

index <- seq(1,5,1)
fruit <- c("Apple Pie", "Apple Pie.", "Apple. Pie", "Apple Pie", "Pear")

df <- data.frame(index, fruit) %>%
  mutate(grouping = phonetic(fruit)) %>%
  add_count(fruit) %>%
  # Missing Code
  select(index, fruit)
rsylatian
  • 429
  • 2
  • 14

3 Answers3

2

We can use str_remove to remove the .

library(dplyr)
library(stringr)
data.frame(index, fruit) %>% 
    mutate(fruit = str_remove(fruit, "\\."))
# index     fruit
#1     1 Apple Pie
#2     2 Apple Pie
#3     3 Apple Pie
#4     4 Apple Pie
#5     5      Pear

If we need to use phonetic and find the most frequent value

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}


data.frame(index, fruit) %>%
   mutate(grouping = phonetic(fruit)) %>%
   group_by(grouping) %>% 
   mutate(fruit = Mode(fruit))
# A tibble: 5 x 3
# Groups:   grouping [2]
#  index fruit     grouping
#  <dbl> <fct>     <chr>   
#1     1 Apple Pie A141    
#2     2 Apple Pie A141    
#3     3 Apple Pie A141    
#4     4 Apple Pie A141    
#5     5 Pear      P600    
akrun
  • 874,273
  • 37
  • 540
  • 662
2

Sounds like you need group_by the grouping, then select the most frequent (Mode) item

df%>%mutate(grouping = phonetic(fruit))%>%
     group_by(grouping)%>%
     mutate(fruit = names(which.max(table(fruit))))

# A tibble: 5 x 3
# Groups:   grouping [2]
  index     fruit grouping
  <dbl>    <fctr>    <chr>
1     1 Apple Pie     A141
2     2 Apple Pie     A141
3     3 Apple Pie     A141
4     4 Apple Pie     A141
5     5      Pear     P600
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Another way could be:

fruit %>%
 enframe() %>%
 mutate(grouping = phonetic(fruit)) %>%
 add_count(value, grouping) %>%
 group_by(grouping) %>%
 mutate(value = value[match(max(n), n)]) %>%
 select(-n) %>%
 ungroup()

   name value     grouping
  <int> <chr>     <chr>   
1     1 Apple Pie A141    
2     2 Apple Pie A141    
3     3 Apple Pie A141    
4     4 Apple Pie A141    
5     5 Pear      P600 
tmfmnk
  • 38,881
  • 4
  • 47
  • 67