In R, trying to calculate Levenshtein distance of strings in a column then cluster and label by another column

Question

Here is a truncated version of my data set. There are many more rows in the full set. I know I can convert the second column to a vector via as.vector(df[,2]), which I can then use for distance calculation. Once I have the distances, I'm going to cluster. But then I want to know how whether the ones that corresponded to "1" from the first column ended up clustering together, and with "2", "3", and so on. How would I go about that?

G5W · Answer 1 · 2016-12-17T01:27:48.390

It would be more helpful to include a text dump of your data using dput(), rather than an image of your data. It looks like your data might be in Excel. You could save it as a csv file and load it into R using read.csv with stringsAsFactors=FALSE. Then your SecondaryStructure column would be strings. Once you have that, load the stringdist package (install if you don't have it). That package has a function called stringdist that will give you a distance matrix using Levenstein distance. Most of the clustering algorithms will take a distance matrix as input. You might start out with hclust (and maybe better to not use the default method="complete" but instead use method="single"). hclust will give you a tree structure. You will have to use cutree to turn that into a set of cluster assignments. When you have the cluster assignments just use table(Clusters, PrimarySeqGroup) to get a confusion matrix.

I hope that this helps.

"You will have to use cut to turn that into a set of cluster assignments. When you have the cluster assignments just use table(Clusters, PrimarySeqGroup) to get a confusion matrix." That's what I was looking for. Thanks! The screenshot was actually from R. I made sure to import it properly :) — Eric Brenner, Dec 14 '16 at 21:13

In R, trying to calculate Levenshtein distance of strings in a column then cluster and label by another column

1 Answers1