6

I have a large data set 45421 * 12 (rows * columns) which contains all categorical variables. There are no numerical variables in my dataset. I would like to use this dataset to build unsupervised clustering model, but before modeling I would like to know the best feature selection model for this dataset. And I am unable to plot elbow curve to this dataset. I am giving range k = 1-1000 in k-means elbow method but it's not giving any optimal clusters plot and taking 8-10 hours to execute. If any one suggests a better solution to this issue it will be a great help.

Code:

data = {'UserName':['infuk_tof', 'infus_llk', 'infaus_kkn', 'infin_mdx'], 
       'UserClass':['high','low','low','medium','high'], 
       'UserCountry':['unitedkingdom','unitedstates','australia','india'], 
       'UserRegion':['EMEA','EMEA','APAC','APAC'], 
       'UserOrganization':['INFBLRPR','INFBLRHC','INFBLRPR','INFBLRHC'], 
       'UserAccesstype':['Region','country','country','region']} 

df = pd.DataFrame(data) 
seralouk
  • 30,938
  • 9
  • 118
  • 133
Praveen
  • 95
  • 1
  • 5
  • Can you give an example of a few rows of your dataset? And are you using scikit-learn for K-means? – sjc Dec 12 '19 at 18:12
  • 1
    yes . i am using scikit-learn for K-means. these are some rows of my dataset. data = {'UserName':['infuk_tof', 'infus_llk', 'infaus_kkn', 'infin_mdx'], 'UserClass':['high','low','low','medium','high'], 'UserCountry':['unitedkingdom','unitedstates','australia','india'], 'UserRegion':['EMEA','EMEA','APAC','APAC'], 'UserOrganization':['INFBLRPR','INFBLRHC','INFBLRPR','INFBLRHC'] 'UserAccesstype':['Region','country','country','region']} df = pd.DataFrame(data) – Praveen Dec 12 '19 at 19:14
  • The use of k-means in a strictly categorical dataset is not the best approach because float values calculated in k-means algorithm actually do not have meaning. I suggest you use mca and then cluster as this [article](https://medium.com/@varun331/using-multiple-correspondence-analysis-and-clustering-98d25ee70c28) Another alternative to unsupervised clustering of categorical variables is k-modes. The author of k-modes [explains](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.134.83&rep=rep1&type=pdf) better the problems of kmeans for categorical values. – Felipe Miranda Jun 04 '22 at 16:26

2 Answers2

2

For categorical data like this, K-means is not the appropriate clustering algorithm. You may want to look for a K-modes method, which unfortunately not currently included in scikit-learn package. You may want to look at this package for kmodes available on github: https://github.com/nicodv/kmodes which follows much of the syntax you're used to from scikit-learn.

For more, please see the discussion here: https://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-categorical-data

sjc
  • 1,117
  • 3
  • 19
  • 28
-1

To be able to run Kmeans or any other model, you need first to transform the categorical variables into numerical.

Example using OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

data={'UserAccesstype': ['Region', 'country', 'country', 'region'],
 'UserCountry': ['unitedkingdom', 'unitedstates', 'australia', 'india'],
 'UserOrganization': ['INFBLRPR', 'INFBLRHC', 'INFBLRPR', 'INFBLRHC'],
 'UserRegion': ['EMEA', 'EMEA', 'APAC', 'APAC']}

df = pd.DataFrame(data)

  UserAccesstype    UserCountry UserOrganization UserRegion
0         Region  unitedkingdom         INFBLRPR       EMEA
1        country   unitedstates         INFBLRHC       EMEA
2        country      australia         INFBLRPR       APAC
3         region          india         INFBLRHC       APAC

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(df.values)

X_for_Kmeans = enc.transform(df.values).toarray()

X_for_Kmeans
array([[1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0.]])

Use the X_for_Kmeans for the Kmeans fitting. Cheers

seralouk
  • 30,938
  • 9
  • 118
  • 133
  • 3
    Just because you can do this doesn't mean that you should. There's no clearly defined metric to define a distance between data points in the categorical space, and this is an active field of research (See here, for example: https://link.springer.com/article/10.1007/s12652-019-01445-5) – sjc Dec 12 '19 at 22:38