0

Here is a truncated version of my data set. There are many more rows in the full set. I know I can convert the second column to a vector via as.vector(df[,2]), which I can then use for distance calculation. Once I have the distances, I'm going to cluster. But then I want to know how whether the ones that corresponded to "1" from the first column ended up clustering together, and with "2", "3", and so on. How would I go about that?

CinCout
  • 9,486
  • 12
  • 49
  • 67

1 Answers1

0

It would be more helpful to include a text dump of your data using dput(), rather than an image of your data. It looks like your data might be in Excel. You could save it as a csv file and load it into R using read.csv with stringsAsFactors=FALSE. Then your SecondaryStructure column would be strings. Once you have that, load the stringdist package (install if you don't have it). That package has a function called stringdist that will give you a distance matrix using Levenstein distance. Most of the clustering algorithms will take a distance matrix as input. You might start out with hclust (and maybe better to not use the default method="complete" but instead use method="single"). hclust will give you a tree structure. You will have to use cutree to turn that into a set of cluster assignments. When you have the cluster assignments just use table(Clusters, PrimarySeqGroup) to get a confusion matrix.

I hope that this helps.

G5W
  • 36,531
  • 10
  • 47
  • 80
  • "You will have to use cut to turn that into a set of cluster assignments. When you have the cluster assignments just use table(Clusters, PrimarySeqGroup) to get a confusion matrix." That's what I was looking for. Thanks! The screenshot was actually from R. I made sure to import it properly :) – Eric Brenner Dec 14 '16 at 21:13