0

Im trying to Cluster a matrix of words by their semantic correlation with the OPTICS algorithm.

I have a matrix like this:

Table

I want to see each row as a vector (~260 dimensions) and cluster the terms that are closest to each other.

My code so far:

require("dbscan")
require("readxl")

list <- read_excel(choose.files())
list_only_colnames <- read_excel(choose.files())[1]

Matrix<- matrix(unlist(as.double(list$Column2)),266,266,TRUE)

list_only_colnames <- unlist(list_only_colnames)
colnames(Matrix) <- list_only_colnames
rownames(Matrix) <- list_only_colnames

### run OPTICS
res <- optics(Matrix, eps = 10,  minPts = 0,4)
res

Questions

  • How do I show the rownames when it comes to clustering?
  • How do I set the number of clusters in the first place?
jwpfox
  • 5,124
  • 11
  • 45
  • 42
belzebubele
  • 88
  • 10

1 Answers1

0

OPTICS does not have a fixed number of clusters. It's not k-means.

Instead, it is data driven: you choose clusters based on the valleys in the plot, which correspond to dense areas. If there is just one dense area, then everything may be just one cluster. Some data just does not have multiple clusters.

As your input data appears to be a similarity matrix, I do not think treating every row as a feature vector is the proper way. This introduces bias into your data. Rather, use dist=1-sim here as a precomputed distance matrix.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • why do you think I need to use a distance matrix instead of a similarity matrix? I worked out how to cluster my data using optics and it seems it worked out fine. – belzebubele Apr 17 '18 at 17:19
  • You then compute the distance of the similarity vectors. That will 'work' but has very odd semantics. If you would read the manual, it expects data vectors, or you need to set `search="dist"` and pass a *distance* matrix. **Read the documentation**. – Has QUIT--Anony-Mousse Apr 17 '18 at 17:58