3

I'm trying to do a K-means analysis in a dataframe like this:

    URBAN AREA  PROVINCE    DENSITY
0   1          TRUJILLO     0.30
1   2          TRUJILLO     0.03
2   3          TRUJILLO     0.80
3   1          LIMA         1.20
4   2          LIMA         0.04
5   1          LAMBAYEQUE   0.90
6   2          LAMBAYEQUE   0.10
7   3          LAMBAYEQUE   0.08

(You can download it from here)

As you can see, the df refers to different urban areas (with different urban density values) inside provinces. So, I want to do the K-means clasification by one column: DENSITY. To do so, I execute this code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

df=pd.read_csv('C:/Path/to/example.csv')

clustering=KMeans(n_clusters=2, max_iter=300)
clustering.fit(df[['DENSITY']])

df['KMeans_Clusters']=clustering.labels_
df

And I get this result, which is OK for this first part of the example:

    URBAN AREA  PROVINCE    DENSITY     KMeans_Clusters
0   1           TRUJILLO       0.30     0
1   2           TRUJILLO       0.03     0
2   3           TRUJILLO       0.80     1
3   1           LIMA           1.20     1
4   2           LIMA           0.04     0
5   1           LAMBAYEQUE     0.90     1
6   2           LAMBAYEQUE     0.10     0
7   3           LAMBAYEQUE     0.08     0

But now I want to do the k-means classification in urban areas by province. I mean, to repeat the same process inside any province. So I had tried with this code:

df=pd.read_csv('C:/Users/rojas/Desktop/example.csv')

clustering=KMeans(n_clusters=2, max_iter=300)

clustering.fit(df[['DENSITY']]).groupby('PROVINCE')

df['KMeans_Clusters']=clustering.labels_
df

but I get this message:

AttributeError                            Traceback (most recent call last)
<ipython-input-4-87e7696ff61a> in <module>
      3 clustering=KMeans(n_clusters=2, max_iter=300)
      4 
----> 5 clustering.fit(df[['DENSITY']]).groupby('PROVINCE')
      6 
      7 df['KMeans_Clusters']=clustering.labels_

AttributeError: 'KMeans' object has no attribute 'groupby'

Is there a way to do so?

José Rojas
  • 313
  • 1
  • 8
  • Try `clustering.fit(df.groupby('PROVINCE')['DENSITY'])` instead. – David Lee Jul 03 '21 at 00:57
  • @DavidLee I get this: `ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (3, 2) + inhomogeneous part.` – José Rojas Jul 03 '21 at 01:09

1 Answers1

2

try this

def k_means(row):
    clustering=KMeans(n_clusters=2, max_iter=300)
    model = clustering.fit(row[['DENSITY']])
    row['KMeans_Clusters'] = model.labels_
    return row

df = df.groupby('PROVINCE').apply(k_means)

results

URBAN   AREA    PROVINCE    DENSITY KMeans_Clusters
0   0   1   TRUJILLO    0.30    0
1   1   2   TRUJILLO    0.03    0
2   2   3   TRUJILLO    0.80    1
3   3   1   LIMA    1.20    1
4   4   2   LIMA    0.04    0
5   5   1   LAMBAYEQUE  0.90    0
6   6   2   LAMBAYEQUE  0.10    1
7   7   3   LAMBAYEQUE  0.08    1

fthomson
  • 773
  • 3
  • 9
  • Thanks. It works. If it's not much of a hassle, could you explain me the `def` and `return` function? I don't know either what `row` do (in the firsth line of code) . – José Rojas Jul 03 '21 at 01:16
  • 3
    So in this example I created my own function and applied it to your dataframe. The issue you were having is that you were trying to use the "groupby" method on an object that is not a dataframe. So to solve that issue you first group the dataframe by "PROVINCE" and then you fit the model by applying a function to the group. When you apply a function to a grouped dataframe each group is individually passed into the function. You can see this by sticking "print(row)" inside the function k_means – fthomson Jul 03 '21 at 01:24
  • 1
    def is short for define and is what is used to define a function. return is the value you want to return from that function – fthomson Jul 03 '21 at 01:26