10

I'm messing around with machine learning, and I've written a K Means algorithm implementation in Python. It takes a two dimensional data and organises them into clusters. Each data point also has a class value of either a 0 or a 1.

What confuses me about the algorithm is how I can then use it to predict some values for another set of two dimensional data that doesn't have a 0 or a 1, but instead is unknown. For each cluster, should I average the points within it to either a 0 or a 1, and if an unknown point is closest to that cluster, then that unknown point takes on the averaged value? Or is there a smarter method?

Cheers!

doug
  • 69,080
  • 24
  • 165
  • 199
DizzyDoo
  • 1,489
  • 6
  • 21
  • 32

4 Answers4

18

To assign a new data point to one of a set of clusters created by k-means, you just find the centroid nearest to that point.

In other words, the same steps you used for the iterative assignment of each point in your original data set to one of k clusters. The only difference here is that the centroids you are using for this computation is the final set--i.e., the values for the centroids at the last iteration.

Here's one implementation in python (w/ NumPy):

>>> import numpy as NP
>>> # just made up values--based on your spec (2D data + 2 clusters)
>>> centroids
      array([[54, 85],
             [99, 78]])

>>> # randomly generate a new data point within the problem domain:
>>> new_data = NP.array([67, 78])

>>> # to assign a new data point to a cluster ID,
>>> # find its closest centroid:
>>> diff = centroids - new_data[0,:]  # NumPy broadcasting
>>> diff
      array([[-13,   7],
             [ 32,   0]])

>>> dist = NP.sqrt(NP.sum(diff**2, axis=-1))  # Euclidean distance
>>> dist
      array([ 14.76,  32.  ])

>>> closest_centroid = centroids[NP.argmin(dist),]
>>> closest_centroid
       array([54, 85])
doug
  • 69,080
  • 24
  • 165
  • 199
2

I know that I might be late, but that is my general solution to your problem:

def predict(data, centroids):
    centroids, data = np.array(centroids), np.array(data)
    distances = []
    for unit in data:
        for center in centroids:
            distances.append(np.sum((unit - center) ** 2))                
    distances = np.reshape(distances, data.shape)
    closest_centroid = [np.argmin(dist) for dist in distances]
    print(closest_centroid)
alghul96
  • 75
  • 5
1

If you are considering assigning a value based on the average value within the nearest cluster, you are talking about some form of "soft decoder", which estimates not only the correct value of the coordinate but your level of confidence in the estimate. The alternative would be a "hard decoder" where only values of 0 and 1 are legal (occur in the training data set), and the new coordinate would get the median of the values within the nearest cluster. My guess is that you should always assign only a known-valid class value (0 or 1) to each coordinate, and averaging class values is not a valid approach.

Dave
  • 3,834
  • 2
  • 29
  • 44
0

This is how I assigned labels to my closer existing centroid. It can be also useful to implement online/incremental clustering, creating new assignation to the existing clusters, but keeping centroids fixed. Be careful, cause after (let's say) 5-10% new points, you might want to recalculate the centroid oordinates.

def Labs( dataset,centroids ):    
a = []
for i in range(len(dataset)):
    d = []
    for j in range(n):        
        dist = np.linalg.norm(dataset[(i),:]-centroids[(j),:])
        d.append(dist)
    assignment = np.argmin(d)
    a.append(assignment)
return pd.DataFrame(np.array(a) + 1,columns =['Lab'])

I hope it helps

Javi
  • 913
  • 2
  • 13
  • 15
  • Please include some explanation of what you have here, and why it should or could fix OP's problem. This will not help him learn from his mistakes. – Granny Jan 31 '18 at 14:04