2

For a set of documents, I have a feature matrix of size 30 X 32 where rows represent documents and columns = features. So basically 30 documents and 32 features for each of them. After running a PSO Algorithm, I have been able to find some cluster centroids (that I am not at the moment sure if they are optimum) each of which is a row vector of length 32. And I have a column vector of size 30X1 which shows the centroid each document has been assigned to. So index one of this vector would contain the index of the centroid to which document 1 has been assigned and so on. This is obtained after computing euclidean distances of each of the documents from the centroids. I wanted to get some hints regarding whether there is a way in R to plot this multidimensional data in the form of clusters. Is there a way, for example, by which I could either collapse these dimensions to 1-D, or somehow show them in a graph that might be a bit pretty to look at. I have been reading on Multidimensional Scaling. So far what I understand about it is that it is a way to reduce a multi-dimensional data to lower dimensions, which does seem what I want. So, I tried it on with this code (the centroids[[3]] basically consists of 4 X 32 matrix and represents the 4 centroids):

   points <- features.dataf[2:ncol(features.dataf)]
row.names(points) <- features.dataf[,1]

fit <- cmdscale(points, eig = TRUE, k = 2)
x <- fit$points[, 1]
y <- fit$points[, 2]
plot(x, y, pch = 19, xlab="Coordinate 1", ylab="Coordinate 2", main="Clustering Text Based on PSO", type="n")
text(x, y, labels = row.names(points), cex=.7)

It gives me this error:

Error in cmdscale(pointsPlusCentroids, eig = TRUE, k = 2) : 
  distances must be result of 'dist' or a square matrix

However, it does seem to give a plot alright. But the pch = 19 point symbols do not appear, just the text names. Like this: enter image description here

In addition to above, I want to color these such that the documents that lie in cluster 1 get colored to one color and those in 2 to a different color and so on. Is there any way to do this if I have a column vector with centroids present in this way:

     [,1]
 [1,]    1
 [2,]    3
 [3,]    1
 [4,]    4
 [5,]    1
 [6,]    4
 [7,]    3
 [8,]    4
 [9,]    4
[10,]    4
[11,]    2
[12,]    2
[13,]    2
[14,]    2
[15,]    1
[16,]    2
[17,]    1
[18,]    4
[19,]    2
[20,]    4
[21,]    1
[22,]    1
[23,]    1
[24,]    1
[25,]    1
[26,]    3
[27,]    4
[28,]    1
[29,]    4
[30,]    1

Could anyone please help me with this? Or if there is any other way to plot multi-dimensional clusters like these. Thank you!

QPTR
  • 1,620
  • 7
  • 26
  • 47

1 Answers1

1

As cmdscale needs distances, try cmdscale(dist(points), eig = TRUE, k = 2). Symbols do not appear because of type = "n". For coloring text, use: text(x, y, rownames(points), cex = 0.6, col = centroids)

Robert
  • 5,038
  • 1
  • 25
  • 43
  • Could you please explain a bit how does col = centroids achieve this. I am looking at what col does but its not very obvious how the column vector automatically gets converted to colors. Thank you. – QPTR May 25 '15 at 20:03
  • 1
    No mistery. R converts numbers 1:8 to colors according to `palette()`. Se for example: `n=5`; `pie(rep(1,n), col=FALSE); pie(rep(1,n), col="red"); pie(rep(1,n), col=2) ; pie(rep(1,n), col=1:2) ;#recycled; pie(rep(1,n), col=1:n) ; palette(); #eight basics colors; pie(rep(1,n), col=c("black","red","green3","blue","cyan")) ; pie(rep(1,n), col=rainbow(n)); pie(rep(1,n), col=terrain.colors(n)); n=15; pie(rep(1,n), col=1:n); #recycled; pie(rep(1,n), col=terrain.colors(n))` – Robert May 25 '15 at 22:55