I've built a cluster model in R (kmeans):
fit <- kmeans(sales_DP_DCT_agg_tr_bel_mod, 4)
Now I want to use this model and segment a brand new data set. How can I:
- store the model
- run the model on an new data set?
I've built a cluster model in R (kmeans):
fit <- kmeans(sales_DP_DCT_agg_tr_bel_mod, 4)
Now I want to use this model and segment a brand new data set. How can I:
Let's say you're using iris
as a dataset.
data = iris[,1:4] ## Don't want the categorical feature
model = kmeans(data, 3)
Here's what the output looks like:
>model
K-means clustering with 3 clusters of sizes 96, 33, 21
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 6.314583 2.895833 4.973958 1.7031250
2 5.175758 3.624242 1.472727 0.2727273
3 4.738095 2.904762 1.790476 0.3523810
Clustering vector:
[1] 2 3 3 3 2 2 2 2 3 3 2 2 3 3 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 3 3 2 2 3 2 3 2 2 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[76] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Within cluster sum of squares by cluster:
[1] 118.651875 6.432121 17.669524
(between_SS / total_SS = 79.0 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
Notice you have access to the centroids using model$centers
. All you have to do to classify an incoming sample is find which centroid it's closest to. You could define a euclidean distance function as follows:
eucDist <- function(x, y) sqrt(sum( (x-y)^2 ))
And then a classifying function:
classifyNewSample <- function(newData, centroids = model$centers) {
dists = apply(centroids, 1, function(y) eucDist(y,newData))
order(dists)[1]
}
> classifyNewSample(c(7,3,6,2))
[1] 1
> classifyNewSample(c(6,2.7,4.3,1.4))
[1] 2
As far as model persistence goes, checkout ?save
here.
To apply the predict function to a new matrix:
## I'm just generating a random matrix of 50x4 here:
r <- 50
c <- 4
m0 <- matrix(0, r, c)
new_data = apply(m0, c(1,2), function(x) sample(seq(0,10,0.1),1))
new_labels = apply(new_data, 1, classifyNewSample)
>new_labels
[1] 1 2 3 3 2 1 3 1 3 1 2 3 3 1 1 3 1 1 1 3 1 1 1 1 1 1 3 1 1 3 3 1 1 3 2 1 3 2 3 1 2 1 2 1 1 2 1 3 2 1