1

I have a dataset, in which i will be using only a single column to apply kmeans clustering. However while plotting the graph, i am getting "numpy.ndarray". I tried converting to float, but still facing the same issue

Dataframe:

 Brim
 1234.5
 345
 675.7
 120
 110

Code:

 from sklearn.cluster import KMeans
 import numpy as np
 km = KMeans(n_clusters=4, init='k-means++',n_init=10)
 km.fit(df1)
 x = km.fit_predict(df1)
 x
 array([0, 0, 0, ..., 3, 3, 3])

 np.shape(x)
 (1097,)

  import matplotlib.pyplot as plt
  %matplotlib inline

  plt.scatter(df1[x ==1,0], df1[x == 0,1], s=100, c='red')
  plt.scatter(df1[x ==1,0], df1[x == 1,1], s=100, c='black')
  plt.scatter(df1[x ==2,0], df1[x == 2,1], s=100, c='blue')
  plt.scatter(df1[x ==3,0], df1[x == 3,1], s=100, c='cyan')

Error:

   ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-62-5f0966ccc828> in <module>()
     1 import matplotlib.pyplot as plt
     2 get_ipython().run_line_magic('matplotlib', 'inline')
  ----> 3 plt.scatter(df1[x ==1,0], df1[x == 0,1], s=100, c='red')
     4 plt.scatter(df1[x ==1,0], df1[x == 1,1], s=100, c='black')
     5 plt.scatter(df1[x ==2,0], df1[x == 2,1], s=100, c='blue')

     ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
     2137             return self._getitem_multilevel(key)
     2138         else:
   ->2139             return self._getitem_column(key)
     2140 
     2141     def _getitem_column(self, key):

    ~\AppData\Local\Continuum\anaconda3\lib\site- 
 packages\pandas\core\frame.py in _getitem_column(self, key)
     2144         # get column
     2145         if self.columns.is_unique:
  -> 2146             return self._get_item_cache(key)
     2147 
     2148         # duplicate columns & possible reduce dimensionality

   ~\AppData\Local\Continuum\anaconda3\lib\site- packages\pandas\core\generic.py in _get_item_cache(self, item)
     1838         """Return the cached item, item represents a label indexer."""
     1839         cache = self._item_cache
  -> 1840         res = cache.get(item)
     1841         if res is None:
     1842             values = self._data.get(item)

   TypeError: unhashable type: 'numpy.ndarray'
anagha s
  • 323
  • 1
  • 4
  • 15

2 Answers2

0

If I understood your code correctly, you're trying to slice your DataFrame for plotting, based on the values of x. For that, you should be using df1.loc[x==1,0] instead of df1[x==1,0] (and so on for all other slices).

Asmus
  • 5,117
  • 1
  • 16
  • 21
  • Still the error , TypeError: cannot do label indexing on with these indexers [0] of – anagha s May 19 '19 at 18:55
  • guess its because of only one variable, how do i plot a univariate graph in this case? – anagha s May 19 '19 at 18:56
  • 1
    @anaghas that (new?) error sounds like your DataFrame has a ”string" index (instead of e.g. 0,1,2,...), so you can not use `x=[0,0,0,..3,3,]` (which is of type `int64`) as a mask here. What are the returns of `print(df1.index.dtype)` and `print(x.dtype)`? – Asmus May 19 '19 at 20:17
  • I'm running into the same problem. `print(sample[y_kmeans == 0])` works fine i.e. filters the rows correctly with the value == 0. However, `print(sample[y_kmeans == 0, 0])` throws below error `TypeError: '(array([False, False, False, False, False, False, False, False, False,True, True]), 0)' is an invalid key` I tried `print(sample[y_kmeans == 0, True])`, this throws same error as well. Any suggestions? Please – Sachin G Dec 12 '21 at 04:25
0

In my case, I was trying to pick random 2 features and run KMeans classifier on it.

sample = df[['f1','f2','f3','f4','f5','f6','f7']].sample(2, axis=1)
kmeans_classifier = KMeans(n_clusters=3) # select random features
y_kmeans = kmeans_classifier.fit_predict(sample)
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 75, c ='red', label = 'Zero')

Last line was throwing the TypeError. I resolved this by converting the sample DataFrame to Numpy representation with values.

Modified code:

sample = df[['f1','f2','f3','f4','f5','f6','f7']].sample(2, axis=1).values
Sachin G
  • 129
  • 1
  • 10