0

I have calculated clusters with a big dataset (1) and found four clusters which I plotted. Now I have 30 new data points (2) that I want to plot in/ on top of the existing clusters in order to see which of the new data points is closest to the original cluster centroids (of the 1. big dataset).

What I did so far:

#I have combined both data sets (1. my old big data set) and (2. my 30 new data points) and added an indicator variable in order to distinguish between the old and new data sets:
# I only chose variables that are needed for the cluster calculations as well as the indicator
combined.ind <-  combined [, c(1752:1757, 1759:1762, 1942)]

#I created a factor variable that indicates "new' and old variables
combined.ind$indicator <- factor(combined.ind$indicator, 
                                 levels = c(0,1),
                                 labels = c("new", "old"))

#Then I calculated a hierarchical cluster analysis with the ward-centroids which I have then used for calculating a k-means clustering:
#calculate ward-centroids:

combined.ward.cent <- aggregate(cbind(Z1, Z2, Z3, Z4, Z5, Z6, Z7, Z8, Z9, Z10)~CLU4_1,combined,mean)
combined.ward.cent2 <-  combined.ward.cent[, c(2:11)]

#apply kmeans with ward centroids as initial starting points:
kmeans <- kmeans(combined.ind[1:(length(combined.ind)-1)], centers = combined.ward.cent2)


#Then I have plotted the results and tried to highlight the new data points:
#Plot the results
fviz_cluster(kmeans, data = combined.ind[, 1:length(combined.ind)-1])

#I changed the colors with scale color manual in order to see the new data points.
fviz_cluster(kmeans, data = combined.ind[, 1:length(combined.ind)-1], geom=c("point", "text"), ellipse = T) + geom_point(aes(color=combined.ind$indicator)) + ggtitle("My Beautiful Graph") +
  scale_color_manual("Old vs New", values = c("new" = "black", "old" = "red"))

Since the first dataset is huge, I cannot see/read the rownames of the new data points because all of them overlap. When I add repel=True to the argument (see below) only the rownames of the data points on the edge are visualized, which does not help me because I am trying to only visualize the rownames of the new data points.

fviz_cluster(kmeans, data = combined.ind[, 1:length(combined.ind)-1], geom=c("point", "text"), repel = TRUE, ellipse = T) +
  geom_point(aes(color=combined.ind$indicator)) + ggtitle("My Beautiful Graph") +
  scale_color_manual("Old vs New", values = c("new" = "black", "old" = "red"))

How can I solve this problem?

0 Answers0