Grouping string variables from a dataframe by best string match to make subsets

Question

I have a dataframe with a column with names of countries. Those names are written different even when they are the same country for example, there are differences in lower-upper cases, some letters missing, some extra letters and son on.

So I need to group them within similar patterns. For example, I have two observations that belongs to the same category: ("Brasil","brazil") that I need to put together. I cannot do this by hand because the entire dataframe is composed of ~10 000 observations.

After making those observations that are similar in one category, I need to make some subsets from this categories.

Is there a possible solution for grouping those similar names in a category and then make subsets with this categories with the other columns from the dataframe?

I was trying to use agrep function with no succes.

number <- c(1:6)
country <- c("Brasil","brazil","Costa Rica","costarrica","suiza","Holanda")
example <- data.frame(number,country)

agrupamiento <- for (i in 1:nrow(example)){
  agrep(example$country[i], example$country, 
    max.distance = 0.1,ignore.case = TRUE)
}

See my approach ,if it is what you need you can accept it – BENY Oct 20 '17 at 12:59 — BENY, Oct 20 '17 at 12:59

score 1 · Accepted Answer · answered Oct 19 '17 at 23:04

Working for you sample data set by using stringdist::phonetic

library(stringdist)
example$ph=phonetic(example$country)
example
  number    country   ph
1      1     Brasil B624
2      2     brazil B624
3      3 Costa Rica C236
4      4 costarrica C236
5      5      suiza S200
6      6    Holanda H453

Then, we split it

out <- split(example,f = example$ph )
out
$B624
  number country   ph
1      1  Brasil B624
2      2  brazil B624

$C236
  number    country   ph
3      3 Costa Rica C236
4      4 costarrica C236

$H453
  number country   ph
6      6 Holanda H453

$S200
  number country   ph
5      5   suiza S200

Grouping string variables from a dataframe by best string match to make subsets

1 Answers1