4

I'm doing some clustering work with the Accord.net library. Ultimately, I'm trying to find the optimal number of clusters to use with the elbow method which requires some relatively simple calculations. However, I'm having a hard time getting the values I need in order to determine the best number of K to use in my KMeans modelling.

I have some example data/code:

open Accord
open Accord.Math
open Accord.MachineLearning
open Accord.Statistics
open Accord.Statistics.Analysis

let x = [|
    [|4.0; 1.0; 1.0; 2.0|]; 
    [|2.0; 4.0; 1.0; 2.0|]; 
    [|2.0; 3.0; 1.0; 1.0|]; 
    [|3.0; 6.0; 2.0; 1.0|]; 
    [|4.0; 4.0; 1.0; 1.0|]; 
    [|5.0; 10.0; 1.0; 2.0|]; 
    [|7.0; 8.0; 1.0; 2.0|]; 
    [|6.0; 5.0; 1.0; 1.0|]; 
    [|7.0; 7.0; 2.0; 1.0|]; 
    [|5.0; 8.0; 1.0; 1.0|]; 
    [|4.0; 1.0; 1.0; 2.0|]; 
    [|3.0; 5.0; 0.0; 3.0|]; 
    [|1.0; 2.0; 0.0; 0.0|]; 
    [|4.0; 7.0; 1.0; 2.0|]; 
    [|5.0; 3.0; 2.0; 0.0|]; 
    [|4.0; 11.0; 0.0; 3.0|]; 
    [|8.0; 7.0; 2.0; 1.0|]; 
    [|5.0; 6.0; 0.0; 2.0|]; 
    [|8.0; 6.0; 3.0; 0.0|]; 
    [|4.0; 9.0; 0.0; 2.0|] 
    |]

and I can generate the clusters easily enough with

let kmeans = new KMeans 5

let kmeansMod = kmeans.Learn x
let clusters = kmeansMod.Decide x

but how can I calculate the distance from any given data point x to it's assigned cluster? I don't see anything in the KMeans Cluster Collection class documentation that suggests there's already a method implemented for this problem.

It seems like it should be relatively simple to calculate this distance, but I'm at a loss. Would it be as easy as doing something like

let dataAndClusters = Array.zip clusters x

let getCentroid (m: KMeansClusterCollection) (i: int) = 
    m.Centroids.[i]

dataAndClusters
|> Array.map (fun (c, d) -> (c, (getCentroid kmeansMod c) 
                                |> Array.map2 (-) d
                                |> Array.sum))

which returns

val it : (int * float) [] =
  [|(1, 0.8); (0, -1.5); (1, -0.2); (0, 1.5); (0, -0.5); (4, 0.0); (2, 1.4);
    (2, -3.6); (2, 0.4); (3, 0.75); (1, 0.8); (0, 0.5); (1, -4.2); (3, -0.25);
    (1, 2.8); (4, 0.0); (2, 1.4); (3, -1.25); (2, 0.4); (3, 0.75)|]

Am I calculating this distance correctly? I suspect not.

As I mentioned, I'm looking to determine the correct number of K to use in KMeans clustering. I just thought I'd use the simple algorithm laid out in the second paragraph of this Stats.StackExchange.com answer. Please note that I am not opposed to using the "Gap Statistic" linked to at the bottom of the top answer.

Community
  • 1
  • 1
Steven
  • 3,238
  • 21
  • 50
  • You should be able to compute the distance to its nearest cluster using the Scores() method instead of Decide(). – Cesar Jul 08 '17 at 16:48

1 Answers1

1

Turns out that I wasn't calculating distances correctly, but I was close.

Doing some more digging, I saw this similar question, but for the R language and broke down the process outlined in that accepted answer in my own R session.

The steps seem to be pretty straightforward:

1. From each data value, subtract the centroid values
2. Sum the differences for a given data/centroid pair
3. Square the differences
4. Find the square root of the differences.

For my example data above, it would break down to this:

let distances = 
    dataAndClusters
    |> Array.map (fun (c, d) -> (c, ((getCentroid kmeansMod c) 
                                    |> Array.map2 (-) d
                                    |> Array.sum
                                    |> float) ** 2.0
                                    |> sqrt))

Note the addition of two lines,

|> float) ** 2.0 converts the value to a float so that it can be squared (i.e., x**y)

and

|> sqrt) which finds the square root of the value.

There may be a built-in method for doing this, but I haven't found it yet. For now, this works for me.

Community
  • 1
  • 1
Steven
  • 3,238
  • 21
  • 50