1

I am trying to implement Kernel K Means clustering with the kkmeans() function from the kernlab R package. My problem is that my code returns the expected output when I specify some numbers of clusters with the function's clusters argument, but throws an error for other numbers of clusters:

Error in if (sum(abs(dc)) < 1e-15) break : missing value where TRUE/FALSE needed

My guess is that this is a convergence issue since the error seems to arise when I increase the number of clusters, but this would be surprising since I have many more rows than the number of clusters I'm specifying. While I can specify 10 clusters with success with an 8000x3 matrix, I receive an error with 100 clusters. Similarly, I can specify 5 clusters but not 10 with a 50-row subset of that data.

Below is a reproducible minimal example where my code replicates the success and the error.

Error if centers = 10

kernlab::kkmeans(mymat, centers=10)
#> Using automatic sigma estimation (sigest) for RBF or laplace kernel
#> Error in if (sum(abs(dc)) < 1e-15) break: missing value where TRUE/FALSE needed

No error if centers = 5

kernlab::kkmeans(mymat, centers=5)
#> Using automatic sigma estimation (sigest) for RBF or laplace kernel
#> Spectral Clustering object of class "specc" 
#> 
#>  Cluster memberships: 
#>  
#> 1 1 1 1 2 1 1 3 3 5 5 5 3 2 2 2 4 4 3 3 5 2 2 5 5 5 5 5 5 2 4 3 3 3 2 2 5 3 3 5 5 4 4 4 3 1 4 2 5 3 
#>  
#> Gaussian Radial Basis kernel function. 
#>  Hyperparameter : sigma =  0.756590498067127 
#> 
#> Centers:  
#>          [,1]      [,2]     [,3]
#> [1,] 15.75871 -16.69486 191.5841
#> [2,] 16.74850 -21.94730 186.8914
#> [3,] 15.99483 -18.95892 190.2622
#> [4,] 15.45729 -18.13571 191.9611
#> [5,] 16.69136 -22.19600 187.0055
#> 
#> Cluster size:  
#> [1]  7 10 12  7 14
#> 
#> Within-cluster sum of squares:  
#> [1] 301006.7 443237.8 607889.4 305777.1 685823.5

Example data (50x3 matrix)

mymat <- structure(c(15.9390001296997, 15.9079999923706, 16.087999343872, 
15.7930002212524, 15.9619998931884, 15.6129999160766, 15.7550001144409, 
16.7740001678466, 16.9080009460449, 17.0769996643066, 16.3640003204345, 
16.5960006713867, 16.579999923706, 16.4570007324218, 16.2320003509521, 
16.1639995574951, 15.6180000305175, 15.5109996795654, 15.5120000839233, 
15.628999710083, 16.9950008392333, 17.3530006408691, 17.2229995727539, 
16.8910007476806, 17.1800003051757, 17.1709995269775, 16.9860000610351, 
16.704999923706, 16.273000717163, 15.8830003738403, 15.6230001449584, 
15.333999633789, 15.3839998245239, 15.3870000839233, 17.1119995117187, 
17.6200008392333, 16.8349990844726, 16.4969997406005, 16.2479991912841, 
16.1259994506835, 15.8059997558593, 15.378999710083, 15.4320001602172, 
15.2100000381469, 15.2519998550415, 15.2150001525878, 15.4280004501342, 
17.4790000915527, 16.6739997863769, 16.4330005645751, -16.6299991607666, 
-16.9529991149902, -17.5610008239746, -17.8290004730224, -18.6200008392333, 
-17.1079998016357, -16.25, -21.716999053955, -21.1219997406005, 
-21.8209991455078, -20.1840000152587, -20.0450000762939, -20.9599990844726, 
-19.5240001678466, -18.6590003967285, -19.4379997253417, -18.6280002593994, 
-18.0669994354248, -16.204999923706, -15.5830001831054, -23.9489994049072, 
-23.57200050354, -24.3969993591308, -23.2880001068115, -22.6019992828369, 
-23.2329998016357, -22.5979995727539, -22.6140003204345, -20.8059997558593, 
-19.4300003051757, -19.4729995727539, -17.5690002441406, -16.8110008239746, 
-15.2930002212524, -25.2509994506835, -24.7649993896484, -24.8080005645751, 
-21.9939994812011, -21.5189990997314, -20.329999923706, -20.25, 
-19.1380004882812, -18.6180000305175, -18.5900001525878, -16.1620006561279, 
-14.5329999923706, -14.4359998703002, -25.8169994354248, -24.2159996032714, 
-22.57200050354, 190.996994018554, 190.996002197265, 190.18699645996, 
191.039993286132, 190.205993652343, 191.919006347656, 191.766006469726, 
187.14599609375, 186.889007568359, 186.225997924804, 188.60400390625, 
187.932006835937, 187.837005615234, 188.453002929687, 189.382995605468, 
189.360000610351, 191.25, 191.845001220703, 192.580001831054, 
192.414993286132, 185.358001708984, 184.570999145507, 184.595993041992, 
186.091995239257, 185.613998413085, 185.25, 186.235000610351, 
187.003005981445, 188.744995117187, 190.169998168945, 190.921005249023, 
192.628997802734, 192.768005371093, 193.281997680664, 184.602996826171, 
183.796005249023, 185.414001464843, 187.811004638671, 188.615005493164, 
189.263000488281, 190.167007446289, 191.781997680664, 191.837997436523, 
192.582000732421, 193.399002075195, 194.184005737304, 193.509994506835, 
183.776000976562, 186.173995971679, 187.774993896484), dim = c(50L, 
3L), dimnames = list(NULL, c("x", "y", "z")))
socialscientist
  • 3,759
  • 5
  • 23
  • 58
myfatson
  • 354
  • 2
  • 14
  • If you run `na.omit()` on your matrix, how many rows are left? – Gregor Thomas Aug 23 '22 at 18:17
  • @GregorThomas There are no missing values. This seems to be a convergence issue, if I reduce the number of centers, I can get an answer. I'm updating original post now. – myfatson Aug 23 '22 at 18:34
  • @myfatson Please post an example data object that reproduces the error (i.e. `roi_cluster_training`). Have you also seen https://stackoverflow.com/questions/36027510/error-in-if-anyco-missing-value-where-true-false-needed – socialscientist Aug 23 '22 at 19:53
  • @socialscientist my dataframe is in the original post on pastebin, or are you asking for something else? I did see that other post but don't have any factors, just 3 columns of numerical data. – myfatson Aug 23 '22 at 20:06
  • Yes, please include the data within the post if possible -- avoids dead links, viruses, etc. – socialscientist Aug 23 '22 at 20:44
  • Cleaned up your question to be more legible and *minimal.* – socialscientist Aug 26 '22 at 01:22

1 Answers1

1

This appears to be an issue with something randomly-generated internally by the function during your kkmeans() call. I don't have an answer for "why" this is happening and you'll likely have to check with the authors to determine if it's a bug or intended behavior.

While I reproduced your error with your data and code (running a fresh instance of R every time), the exact same function call also sometimes produces other errors and sometimes doesn't produce an error. However, whether it does so is entirely reproducible when you set.seed(), suggesting it is has something to do with starting values that determine other parameters of the model.

Below I show (a) that this can produce an alternative error (actually, I saw a third but didn't save the seed to reproduce it), (b) that even when it does "converge," it is producing pretty different clusters just on the basis of the random seed, and (c) the hyperparameter tuning is heavily influenced by the random number seed. I forgot to save the seed for the run where I was able to get some clustering results with 10 clusters.

I don't have an answer for why this happens: my hunch is that the automatically-generated settings are nonsensical/out of bounds in some cases and this is producing an error. This may be because your data are in some way strange or may be because the algorithm for setting the hyperparameter(s) doesn't make much sense. It could also be a bug, so perhaps worth posting as an issue.

In any case, a question to ask yourself is whether you want to use something where the behavior is this inconsistent at producing results, produces pretty different results across random seeds, and you don't know if the algorithm is actually doing what it says when it does, etc.

Example 1: clusters=5, no error, set.seed(123)

set.seed(123)
#>  Hyperparameter : sigma =  0.463522505156128 
#> 
#> Centers:  
#>          [,1]      [,2]     [,3]
#> [1,] 16.53045 -21.18700 187.8918
#> [2,] 17.16138 -24.59687 184.7860
#> [3,] 15.73436 -17.87491 191.2586
#> [4,] 15.63425 -16.63862 192.0088
#> [5,] 16.19467 -20.16442 189.1617
#> 
#> Cluster size:  
#> [1] 11  8 11  8 12
#> 
#> Within-cluster sum of squares:  
#> [1] 537972.8 386310.2 544994.1 391965.9 604386.9

Example 2: clusters=5, no error, set.seed(3)

Works, but pretty different numbers of observations per cluster! Note the different hyperparameter.

#>  Hyperparameter : sigma =  0.290281708176631 
#> 
#> Centers:  
#>          [,1]      [,2]     [,3]
#> [1,] 15.97636 -18.38464 190.5449
#> [2,] 16.24809 -20.10409 188.9572
#> [3,] 15.63660 -17.85633 191.5151
#> [4,] 17.06100 -22.70840 185.8834
#> [5,] 17.16138 -24.59687 184.7860
#> 
#> Cluster size:  
#> [1] 11 11 15  5  8
#> 
#> Within-cluster sum of squares:  
#> [1] 545547.7 538434.5 757947.0 236986.8 386310.2

Example 3: clusters=5, no error, set.seed(999)

Works, but pretty different numbers of observations per cluster! Note the different hyperparameter again!


#> Gaussian Radial Basis kernel function. 
#>  Hyperparameter : sigma =  0.128189488632645 
#> 
#> Centers:  
#>          [,1]      [,2]     [,3]
#> [1,] 16.93157 -22.25171 186.4579
#> [2,] 15.45090 -15.99500 192.8452
#> [3,] 15.73677 -18.32277 191.0152
#> [4,] 17.16244 -24.44533 184.8376
#> [5,] 16.32218 -20.69291 188.5965
#> 
#> Cluster size:  
#> [1]  7 10 13  9 11
#> 
#> Within-cluster sum of squares:  
#> [1] 294630.1 457490.3 604486.8 441669.5 539478.6

Example 4: clusters = 10, new error, set.seed(99)

New error.

#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'affinMult' for signature '"rbfkernel", "numeric"'

Example 5: clusters = 10, new error, set.seed(3)

Original error.

#> Error in if (sum(abs(dc)) < 1e-15) break: missing value where TRUE/FALSE needed

Not included: additional error with clusters = 10 (not finding all of the columns in the matrix) and successfully getting some clusters with clusters = 10.

socialscientist
  • 3,759
  • 5
  • 23
  • 58
  • 1
    K means clustering it is. Thanks for looking into this @socialscientist . If you happen to to know the handle of the kernlab author I'll reach out, the email listed on the package website is dead, I think he's moved to another institution since then. – myfatson Aug 26 '22 at 05:01
  • 1
    You should be able to implement your own kernel k-means clustering pretty easily if you know some basic linear algebra and the k-means algorithm... https://medium.com/udemy-engineering/understanding-k-means-clustering-and-kernel-methods-afad4eec3c11 – socialscientist Aug 28 '22 at 01:12