Sci-Kit learn Kmeans and PCA dimensionality reduction
I have a dataset, 2M rows by 7 columns, with different measurements of home power consumption with a date for each measurement.
- date,
- Global_active_power,
- Global_reactive_power,
- Voltage,
- Global_intensity,
- Sub_metering_1,
- Sub_metering_2,
- Sub_metering_3
I put my dataset into a pandas dataframe, selecting all columns but the date column, then perform cross validation split.
import pandas as pd
from sklearn.cross_validation import train_test_split
data = pd.read_csv('household_power_consumption.txt', delimiter=';')
power_consumption = data.iloc[0:, 2:9].dropna()
pc_toarray = power_consumption.values
hpc_fit, hpc_fit1 = train_test_split(pc_toarray, train_size=.01)
power_consumption.head()
I use K-means classification followed by PCA dimensionality reduction to display.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
hpc = PCA(n_components=2).fit_transform(hpc_fit)
k_means = KMeans()
k_means.fit(hpc)
x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1
y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(hpc[:, 0], hpc[:, 1], 'k.', markersize=4)
centroids = k_means.cluster_centers_
inert = k_means.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
Now I would like to find out which rows fell under a given class then which dates fell under a given class.
- Is there any way to relate the points on the graph to an index in my dataset, after PCA?
- Some method I don't know of?
- Or is my approach fundamentally flawed?
- Any recommendations?
I am fairly new to this field and am trying to read through lots of code, this is a compilation of several examples I've seen documented .
My goal is to classify the data and then get the dates that fall under a class.
Thank You