4

I have a categorical attributes that contains string values. three of them contains dayname(mon---sun) monthname and time interval(morning afternoon evening), the other two as i mentioned before has district and street names. followed by gender ,role, comments(it is a predefined fixed field that have values as good, bad strong agree etc)surname and first name.my intention is to cluster them and visualize it. I applied k-mean clustering using this WEKA bur it did not work. Now I wish to apply hierarchical clustering on it. I found this code:

import scipy
import scipy.cluster.hierarchy as sch
X = scipy.randn(100, 2)     # 100 2-dimensional observations
d = sch.distance.pdist(X)   # vector of (100 choose 2) pairwise distances
L = sch.linkage(d, method='complete')
ind = sch.fcluster(L, 0.5*d.max(), 'distance')

However, X in above code is numeric; I have categorical data. Is there some way that I can use a numarray of categorical data to find the distance? In other words can I use categorical data of string values to find the distance? I would then use that distance in sch.linkage(d, method='complete')

Nhqazi
  • 732
  • 3
  • 12
  • 30
  • How do you plan to define the distance between strings -- or is that part of your question? – Prune May 31 '17 at 23:08
  • it is the question, my understanding is that the method of distance calcualtion can be defined in sch.distance.pdist .I intent to use cosine function though not sure if it is the right way to find the distance, so my first problem is how to define the variable X in above code for categorical variable. – Nhqazi Jun 01 '17 at 09:03
  • the basic question I guess is how to represent categorical variables having multiple values. I understand kmode method is used for categorical variables but i intent to use hierarchical clustering. – Nhqazi Jun 01 '17 at 14:15

2 Answers2

2

I think we've identified the problem, then: you leave the X values as they are, string data. You can pass those to pdist, but you also have to supply a 2-arity function (2 inputs, numeric output) for the distance metric.

The simplest one would be that equal classifications have 0 distance; everything else is 1. You can do this with

d = sch.distance.pdist(X, lambda u, v: u != v)

If you have other class discrimination in mind, just code logic to return the desired distance, wrap it in a function, and then pass the function name to pdist. We can't help with that, because you've told us nothing about your classes or the model semantics.

Does that get you moving?

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Thanks. My data has more than 10 attributes and two of them contain the name of district and streets in a city. Each of them could have many distinct values may be over 20. I am not sure the above technique is applicable for this kind of categorical values. please advice. – Nhqazi Jun 01 '17 at 19:41
  • 1
    I advise you to *specify* the problem. "I am not sure ..." is not a specification. I've answered your given question: how to represent the categorical data (just as you're already doing) and how to handle a distance function. I gave you a trivial sample. There is no way I can handle a follow-up question when you continue to avoid any productive comment on what you *do* need for a distance metric. – Prune Jun 01 '17 at 19:51
  • ok here is my specification. i have a categorical attribute that contains string values. three of them contains dayname(mon---sun) monthname and time interval(morning afternoon evening), the other two as i mentioned before has district and street names. followed by gender ,role, comments(it is a predefined fixed field that have values as good, badm strong agree etc)surname and first name.my intention is to cluster them and visualize it. I applied k-mean clustering using this WEKA bur it did not work. I hope I have specified the problem now. – Nhqazi Jun 01 '17 at 22:32
  • Please edit this into the question's main body, for all to see. Also discuss what you imagine as a useful distance function. For instance, is the distance Sun-to-Tue twice as far as Sun-to-Mon, or the same? – Prune Jun 01 '17 at 22:35
0

Another possibility is the use of the Hamming distance.

Y = pdist(X, 'hamming')

Computes the normalized Hamming distance, or the proportion of those vector elements between two n-vectors u and v which disagree. To save memory, the matrix X can be of type boolean.

If your categorical data is represented by a single character e.g.: "m"/"f" it could be what you are looking for.

https://en.wikipedia.org/wiki/Hamming_distance

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist