-1

I've built a cluster model in R (kmeans):

fit <- kmeans(sales_DP_DCT_agg_tr_bel_mod, 4)

Now I want to use this model and segment a brand new data set. How can I:

  1. store the model
  2. run the model on an new data set?
TayTay
  • 6,882
  • 4
  • 44
  • 65

1 Answers1

1

Let's say you're using iris as a dataset.

data = iris[,1:4] ## Don't want the categorical feature
model = kmeans(data, 3)

Here's what the output looks like:

>model
K-means clustering with 3 clusters of sizes 96, 33, 21

Cluster means:
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     6.314583    2.895833     4.973958   1.7031250
2     5.175758    3.624242     1.472727   0.2727273
3     4.738095    2.904762     1.790476   0.3523810

Clustering vector:
  [1] 2 3 3 3 2 2 2 2 3 3 2 2 3 3 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 3 3 2 2 3 2 3 2 2 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [76] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Within cluster sum of squares by cluster:
[1] 118.651875   6.432121  17.669524
 (between_SS / total_SS =  79.0 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"         "ifault"

Notice you have access to the centroids using model$centers. All you have to do to classify an incoming sample is find which centroid it's closest to. You could define a euclidean distance function as follows:

eucDist <- function(x, y) sqrt(sum( (x-y)^2 ))

And then a classifying function:

classifyNewSample <- function(newData, centroids = model$centers) {
  dists = apply(centroids, 1, function(y) eucDist(y,newData))
  order(dists)[1]
}

> classifyNewSample(c(7,3,6,2))
[1] 1
> classifyNewSample(c(6,2.7,4.3,1.4))
[1] 2

As far as model persistence goes, checkout ?save here.

Edit:

To apply the predict function to a new matrix:

## I'm just generating a random matrix of 50x4 here:
r <- 50
c <- 4
m0 <- matrix(0, r, c)
new_data = apply(m0, c(1,2), function(x) sample(seq(0,10,0.1),1))
new_labels = apply(new_data, 1, classifyNewSample)

>new_labels
[1] 1 2 3 3 2 1 3 1 3 1 2 3 3 1 1 3 1 1 1 3 1 1 1 1 1 1 3 1 1 3 3 1 1 3 2 1 3 2 3 1 2 1 2 1 1 2 1 3 2 1
Community
  • 1
  • 1
TayTay
  • 6,882
  • 4
  • 44
  • 65