1

I'd like to create a group variables based upon how similar a selection of names is. I have started by using the stringdist package to generate a measure of distance. But I'm not sure how to use that output information to generate a group by variable. I've looked at hclust but it seems like to use clustering functions you need to know how many groups you want in the end, and I do not know that. The code I start with is below:

name_list <- c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")

name_dist <- stringdistmatrix(name_list)
name_dist
name_dist2 <- stringdistmatrix(name_list, method="soundex")
name_dist2

I would like to see a dataframe with two columns that look like

name = c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")

name_group = c(1, 1, 1, 2, 2, 2, 3, 3, 4)

The groups might be slightly different depending obviously on what distance measure I use (I've suggested two above) but I would probably choose one or the other to run.

Basically, how do I get from the distance matrix to a group variable without knowing the number of clusters I'd like?

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
Kath05
  • 180
  • 1
  • 8
  • 1
    This question is probably too broad, but [this](https://en.wikipedia.org/wiki/Soundex) might give you some ideas to get started with. – joran Aug 27 '15 at 21:00
  • 1
    ...indeed some simple Googling leads one to the **stringdist** package, which might be helpful. – joran Aug 27 '15 at 21:02
  • Indeed -- I must have just pasted part of my code-in. The stringdistmatrix is a function in the stringdist package which generates distances among entries. I was having trouble clustering by the distance after that but I think Huck below has provided a great example I can work with. – Kath05 Aug 28 '15 at 14:57

2 Answers2

5

You can also use adist(...) in base R to calculate the Levenshtein distances, and cluster based on that.

n<- c("Mary", "Mery", "Mari", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
d <- adist(n)
rownames(d)  <- n
cl <- hclust(as.dist(d))
plot(cl)

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Thanks. For what I'm doing, I like the ability to choose my distance methods which stringdist allows. Depending on the list of names, I may group first by phonetic distance (soundex) and then by something else. This is because I think some of the names I'm working with will be transliterated from a different language and alphabet. – Kath05 Aug 28 '15 at 15:00
4

You could use a cluster analysis like this:

# loading the package
require(stringdist);

# Group selection by class numbers or height 
num.class <- 5;
num.height <-0.5;

# define names 
n <- c("Mary", "Mery", "Mari", "Joe", 
       "Jo", "Joey", "Bob", "Beb", "Paul");

# calculate distances
d <- stringdistmatrix(n, method="soundex");

# cluster the stuff
h <- hclust(d);

# cut the cluster by num classes
m <- cutree(h, k = num.class);

# cut the cluster by height
p <- cutree(h, h = num.height);

# build the resulting frame
df <- data.frame(names = n, 
                 group.class = m, 
                 group.prob = p);

It produces:

df;
   names group.class group.prob
1  Mary         1          1
2  Mery         1          1
3  Mari         1          1
4   Joe         2          2
5    Jo         2          2
6  Joey         2          2
7   Bob         3          3
8   Beb         4          3
9  Paul         5          4

And the chart gives you an overview:

plot(h, labels=n);

enter image description here

Regards huck

huckfinn
  • 644
  • 6
  • 23
  • There's no need for semicolons at the end of R statements (unless they're on the same line, of course). – jlhoward Aug 28 '15 at 05:37
  • Thanks, Huck for showing me how to implement the cluster analysis for a project like this one. It seems like that will work with a little playing around! – Kath05 Aug 28 '15 at 14:56
  • @jlhoward Sorry for the semicolon routine. I'm an old limping man, who was drowned in the 90ties in perl and pascal code ;-) – huckfinn Aug 28 '15 at 15:24
  • @huckfinn @Kath05: If you use the soundex method, the distance computed by `stringdist` is 0 if the two words translate to the same soundex code, and 1 otherwise. So you can get your name groups directly, using `name_group <- phonetic(name)`. (Of course this can't be applied to the other methods such as Levenshtein.) – Scarabee Apr 17 '17 at 22:56