4

I have a data frame with 2 variables and I want to use the clusGap function to find the number of clusters that would be optimal to use. This code has a similar result:

library(cluster)    
x <- as.vector(runif(100, 0, 1))
y <- as.vector(runif(100, 0, 1))
df <- data.frame(x, y)
gap_stat <- clusGap(df, FUN = kmeans, nstart = n,
                    K.max = 10, B = 50)
gap_stat

Result:

Clustering Gap statistic ["clusGap"] from call:
clusGap(x = df, FUNcluster = kmeans, K.max = 10, B = 50, nstart = n)
B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
 --> Number of clusters (method 'firstSEmax', SE.factor=1): 1
          logW   E.logW           gap     SE.sim
 [1,] 2.569315 2.584217  0.0149021144 0.03210076
 [2,] 2.285049 2.284537 -0.0005116382 0.03231529
 [3,] 2.053193 2.033653 -0.0195399122 0.03282376
 [4,] 1.839085 1.835590 -0.0034952935 0.03443303
 [5,] 1.691219 1.708479  0.0172603348 0.03419994
 [6,] 1.585084 1.597277  0.0121935992 0.03440672
 [7,] 1.504763 1.496853 -0.0079104306 0.03422321
 [8,] 1.416176 1.405903 -0.0102731340 0.03371149
 [9,] 1.333721 1.323658 -0.0100626869 0.03245958
[10,] 1.253199 1.250366 -0.0028330498 0.03034140

As you can see in line 4, the optimal number of clusters is 1. I would like the function to have 1 as an output. I need the optimal number of outputs to be an object in the environment, such as n is 1.

Marco Pastor Mayo
  • 803
  • 11
  • 25

2 Answers2

5

Typically such information is somewhere directly inside the object, like gap_stat$nc. To look for it str(gap_stat) would typically suffice.

In this case, however, the above strategy isn't enough. But the fact that you can see your number of interest in the output, means that print.clusGap (because the class of gap_stat is clusGap) will show how to obtain this number. So, inspecting cluster:::print.clusGap leads to

maxSE(f = gap_stat$Tab[, "gap"], SE.f = gap_stat$Tab[, "SE.sim"])
# [1] 1
Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
1

This may have been less transparent in the past, but you can actually specify the method directly:

nc <- maxSE(f         = gap_stat$Tab[,"gap"],
            SE.f      = gap_stat$Tab[,"SE.sim"],
            method    = "firstSEmax",
            SE.factor = 1)