0

I have a working example of Mean Shift clustering using Pandas and Sci-kit learn. I am new to Python so I think I am missing something basic here. Here is my working code:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import MeanShift
from matplotlib import style
style.use("ggplot")


filepath = "./Probes1.xlsx"
X = pd.read_excel(filepath, usecols="B:I", header=1)
df=pd.DataFrame(data=X)
np_array = df.values
print(np_array)

ms=MeanShift()
ms.fit(np_array)

labels= ms.labels_
cluster_centers = ms.cluster_centers_
print("cluster centers:")
print(cluster_centers)

labels_unique = np.unique(labels)
n_clusters_=len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)
#colors = 10*['r.','g.','b.','c.','k.','y.','m.']

for i in range(len(np_array)):
    plt.scatter(np_array[i][0], np_array[i][1], edgecolors='face' )
plt.scatter(cluster_centers[:,0],cluster_centers[:,1],c='b',
   marker = "x", s = 20, linewidths = 5, zorder = 10)
plt.show()

Here is the plot that I get from this code :

Plot

However the color of the centers of the clusters do not match with its data points. Any help would be appreciated. Currently I have set my center colors to blue ('b'). Thank you!

EDIT : I was able to create this! 2D-plot with perfect colors

EDIT2 :

from itertools import cycle
import numpy as np
import pandas as pd
from sklearn.cluster import MeanShift
from sklearn.datasets.samples_generator import make_blobs
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


filepath = "./Probes1.xlsx"
X = pd.read_excel(filepath, usecols="B:I", header=1) #import excel data
df=pd.DataFrame(data=X) #excel to dataframe to use in ML
np_array = df.values #dataframe
print(np_array) #printing dataframe

ms = MeanShift()
ms.fit(X) #Clustering
labels=ms.labels_
cluster_centers = ms.cluster_centers_ #coordinates of cluster centers
print("cluster centers:")
print(cluster_centers)

labels_unique = np.unique(labels)
n_clusters_=len(labels_unique) #no. of clusters
print("number of estimated clusters : %d" % n_clusters_)

# ################################# Plotting
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

colors=cycle('bgrkmycbgrkmycbgrkmycbgrkyc')
for k, col in zip(range(n_clusters_), colors):
    my_members= labels == k
    cluster_center = cluster_centers[k]
    ax.scatter(np_array[my_members, 0], np_array[my_members, 1], np_array[my_members, 2], col + '.')
    ax.scatter(cluster_centers[:,0], cluster_centers[:,1], cluster_centers[:,2], marker='o', s=300, linewidth=1, zorder=0)
    print(col) #prints b g r k in the respective iterations
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.grid()
plt.show()

Plots this: 3d plot

Again the colors are not matching, is there any alternative to 'markerfacecolor' from plt.plot in the scatter plots so I can match the colors of clusters with their data points?

EDIT 3: Got the required results: Final3d-Plot

Sam
  • 161
  • 1
  • 9

1 Answers1

0

You're setting your cluster center color to blue with c='b':

plt.scatter(cluster_centers[:,0], cluster_centers[:,1], c='b', marker='x', s=20, linewidths=5, zorder=10)

To match the colors of both scatters, you would have to specify them for both.

ybnd
  • 51
  • 1
  • 4
  • Exactly, if i remove it ( I tried it), it automatically makes it somewhat greyish color which I think is the default value for 'c' . – Sam Jun 12 '20 at 10:38
  • specifying for both means I need 4 colors list? since I have 4 cluster in my dataset, I would need a list of 4 colors for both, right? Is there a way to match the datapoints to its corresponding cluster center for the same color? lets say the index 0 in my list of colors? – Sam Jun 12 '20 at 10:41
  • Assuming `np_array` contains your clusters, you can set the color of each cluster scatter as `c=color[i]` your for loop, and the colors of the cluster centers as `c=color`. – ybnd Jun 12 '20 at 11:43
  • I tried it out, I think the problem is that the `np_array` is just the dataframe and has no connection to clusters, your solution gives me the following error: File "C:/Users/Q499593/PycharmProjects/MachineLearning/Clustering.py", line 30, in plt.scatter(np_array[i][0], np_array[i][1], c=color[i], edgecolors='face' ) IndexError: list index out of range which makes sense because my dataframe has 73 elements and colors only 4. Is there a way to directly plot the datapoints of the cluster in the same color? – Sam Jun 15 '20 at 09:19
  • Right. Then you're actually plotting every entry in your DataFrame to a separate scatter plot, which probably isn't what you want. – ybnd Jun 15 '20 at 10:51
  • Take a look at the for loop [in this example](https://scikit-learn.org/stable/auto_examples/cluster/plot_mean_shift.html). If you iterate over your clusters & colors, you can filter your Data Frame to keep the labeled points – ybnd Jun 15 '20 at 10:54
  • Perfect! I was looking into it since sunday now it finally works! (See 'EDIT' in the question). Is there a way to plot exactly the same graph(reference to colors) but a 3d one ? I have looked into 'scatter' and addsubplot as a 3d projection, am I heading in the right direction? – Sam Jun 15 '20 at 11:14
  • Yeah, there's a lot of info to be found about 3d plotting with `matplotlib`, so you should be fine. – ybnd Jun 15 '20 at 12:39
  • Thank you so much for your help, its great to learn new plotting techniques but now I am facing a similar problem with the 3d graph, since there is no parameter in scatter like 'markerfacecolor', could you point me in the direction for a work around or possibly a solution? I have attached the new screenshot under 'EDIT 2' of my question. Even if i set `c=col` the color of the cluster centers remain the same as in the screenshot. Any help would be appreciated! – Sam Jun 15 '20 at 14:57
  • EDIT 3 shows what I wanted if any other user consults this question :) Thank you for your help! @ybnd – Sam Jun 16 '20 at 13:16