2

Trying to accomplish K-Means in Python using aggregated data files. For example, instead of a data frame with 3 records represented by 3 rows, one row will represent all 3 with a column like cnt (arbitrarily named) representing those 3 unique instances with the number 3 in it.

Below is a set of some basic starter code that does NOT use the aggregated representation of the rows. Please let me know if you would like for me to post the .csv too, but it should be pretty basic:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

data = pd.read_csv('../Data/wholesale_data.csv')
data.head()

categorical_features = ['Channel', 'Region']
continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen', 
'Detergents_Paper', 'Delicassen']

for col in categorical_features: #for each categorical col
    dummies = pd.get_dummies(data[col], prefix=col) #one-hot-encoding
    data = pd.concat([data, dummies], axis=1) #append to data
    data.drop(col, axis=1, inplace=True) #drop orig column
data.head()

mms = MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)

sum_of_squared_distances = []

K = range(1,15)

for k in K:
    km = KMeans(n_clusters=k) #init model
    km = km.fit(data_transformed) #fit model
    sum_of_squared_distances.append(km.inertia_) #overall SSE 


plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
Mayank Patel
  • 3,868
  • 10
  • 36
  • 59
Zach
  • 35
  • 2
  • 6
  • And what is your question? – G. Anderson Apr 18 '19 at 21:50
  • What is the correct code to use? – Zach Apr 18 '19 at 21:51
  • Please see the following link on providing a [mcve]. It's unclear from your question what you've tried, what was wrong with your effort, and your desired output. If the code you posted doesn;t work, _why_ doesn't it work? Error? Wrong output? And if it's code for a different problem, then where's the code you've tried for the current use-case? – G. Anderson Apr 18 '19 at 21:55
  • I suppose my post is not a question as much as it is a query/plea to understand the proper way to code my data. Perhaps this is NOT the correct forum for helping folks understand the syntax intricacies as much as it would be to answer questions. I have an example worked out - and it is based on the code above. The website I am using as reference is the following: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html Where fit(X[, y, sample_weight]) Compute k-means clustering.is what i am trying to accomplish. – Zach Apr 22 '19 at 15:13

0 Answers0