An understandable clusterization

Question

I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal. There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using their 'expert knowledge', but I want to automate this process of clusterization.

Most algorithms for clusterization use distance (Euclidean, Mahalanobdis and so on) between objects to group them in clusters. But it is hard to find some reasonable metrics for mixed data types, i.e. we can't find a distance between 'glass' and 'steel'. So I came to the conclusion that I have to use conditional probabilities P(feature = 'something' | Class) and some utility function that depends on them. It is reasonable for categorical variables, and it works fine with numeric variables assuming they are distributed normally.

So it became clear to me that algorithms like K-means will not produce good results.

At this time I try to work with COBWEB algorithm, that fully matches my ideas of using conditional probabilities. But I faced another obsacles: results of clusterization are really hard to interpret, if not impossible. As a result I wanted to get something like a set of rules that describes each cluster (e.g. if feature1 = 'a' and feature2 in [30, 60], it is cluster1), like descision trees for classification.

So, my question is:

Is there any existing clusterization algorithm that works with mixed data type and produces an understandable (and reasonable for humans) description of clusters.

Additional info:

As I understand my task is in the field of conceptual clustering. I can't define a similarity function as it was suggested (it as an ultimate goal of the whoal project), because of the field of study - it is very complicated and mercyless in terms of formalization. As far as I understand the most reasonable approach is the one used in COBWEB, but I'm not sure how to adapt it, so I can get an undestandable description of clusters.

Decision Tree

As it was suggested, I tried to train a decision tree on the clustering output, thus getting a description of clusters as a set of rules. But unfortunately interpretation of this rules is almost as hard as with the raw clustering output. First of only a few first levels of rules from the root node do make any sense: closer to the leaf - less sense we have. Secondly, these rules doesn't match any expert knowledge.

So, I came to the conclusion that clustering is a black-box, and it worth not trying to interpret its results.

Also

I had an interesting idea to modify a 'decision tree for regression' algorithm in a certain way: istead of calculating an intra-group variance calcualte a category utility function and use it as a split criterion. As a result we should have a decision tree with leafs-clusters and clusters description out of the box. But I haven't tried to do so, and I am not sure about accuracy and everything else.

Why can't you use decision trees where class=cluster? I assume you already have some labeled examples you can use... — amit, Aug 28 '12 at 08:06
@amit that's the point that I don't have labeled examples, and I don't have any existing classes. Ideally I want to achieve the following: input dataset -> clusterization algorithm -> clusters description, and when an expert looks at the description, he says: "Yes, that's it. I understand it, and I would do the same". — Artem Pianykh, Aug 28 '12 at 08:26
Do you know the number of categories or member of categories a priori? Without a distance metric there is really not a good way to determine how good your algorithm is. — argentage, Aug 31 '12 at 19:18
Sorry, I cannot give you much more hints than below. I believe training a decision tree is the best idea, as decision trees are one of the *few* methods around that really give you an explanation of their decision process. — Has QUIT--Anony-Mousse, Sep 03 '12 at 12:15
Related to Anony-Mousse's suggestion, [random forests](http://www.stat.berkeley.edu/~breiman/RandomForests/) can be run in an "unsupervised" mode that leads to a clustering algorithm, and since it's tree based it side-steps the metric definition problem. — joran, Sep 05 '12 at 13:23
Well, decision trees "side step" the metric problem by always looking at single attributes only. If you have correlations in the data, this might not be what you want. P.S. I added an interesting reference on explaining clusters to my reply below. — Has QUIT--Anony-Mousse, Sep 06 '12 at 06:53

Has QUIT--Anony-Mousse · Accepted Answer · 2012-09-06T06:51:28.607

For most algorithms, you will need to define similarity. It doesn't need to be a proper distance function (e.g. satisfy triangle inequality).

K-means is particularly bad, because it also needs to compute means. So it's better to stay away from it if you cannot compute means, or are using a different distance function than Euclidean.

However, consider defining a distance function that captures your domain knowledge of similarity. It can be composed of other distance functions, say you use the harmonic mean of the Euclidean distance (maybe weighted with some scaling factor) and a categorial similarity function.

Once you have a decent similarity function, a whole bunch of algorithms will become available to you. e.g. DBSCAN (Wikipedia) or OPTICS (Wikipedia). ELKI may be of interest to you, they have a Tutorial on writing custom distance functions.

Interpretation is a separate thing. Unfortunately, few clustering algorithms will give you a human-readable interpretation of what they found. They may give you things such as a representative (e.g. the mean of a cluster in k-means), but little more. But of course you could next train a decision tree on the clustering output and try to interpret the decision tree learned from the clustering. Because the one really nice feature about decision trees, is that they are somewhat human understandable. But just like a Support Vector Machine will not give you an explanation, most (if not all) clustering algorithms will not do that either, sorry, unless you do this kind of post-processing. Plus, it will actually work with any clustering algorithm, which is a nice property if you want to compare multiple algorithms.

There was a related publication last year. It is a bit obscure and experimental (on a workshop at ECML-PKDD), and requires the data set to have a quite extensive ground truth in form of rankings. In the example, they used color similarity rankings and some labels. The key idea is to analyze the cluster and find the best explanation using the given ground truth(s). They were trying to use it to e.g. say "this cluster found is largely based on this particular shade of green, so it is not very interesting, but the other cluster cannot be explained very well, you need to investigate it closer - maybe the algorithm discovered something new here". But it was very experimental (Workshops are for work-in-progress type of research). You might be able to use this, by just using your features as ground truth. It should then detect if a cluster can be easily explained by things such as "attribute5 is approx. 0.4 with low variance". But it will not forcibly create such an explanation!

H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011. http://dme.rwth-aachen.de/en/MultiClust2011

You can do the bottom suggestion with COBWEB, too, I guess. Cluster, then learn a decision tree to get a description for the clusters. — Has QUIT--Anony-Mousse, Aug 30 '12 at 11:12
If you make a decision tree from the clustering algorithm - make sure you set minimum number of instances per leaf as 1, and no prunning - this will ensure the resulting tree is consistent with the clustering algorithm. — amit, Aug 30 '12 at 12:32
Thanks for your reply! I will surely try the latter approach when I have enough time, and will update answer on any results. — Artem Pianykh, Sep 06 '12 at 10:28

user1149913 · Answer 2 · 2014-02-01T17:28:01.500

A common approach to solve this type of clustering problem is to define a statistical model that captures relevant characteristics of your data. Cluster assignments can be derived by using a mixture model (as in the Gaussian Mixture Model) then finding the mixture component with the highest probability for a particular data point.

In your case, each example is a vector has both real and categorical components. A simple approach is to model each component of the vector separately.

I generated a small example dataset where each example is a vector of two dimensions. The first dimension is a normally distributed variable and the second is a choice of five categories (see graph):

enter image description here

There are a number of frameworks that are available to run monte carlo inference for statistical models. BUGS is probably the most popular (http://www.mrc-bsu.cam.ac.uk/bugs/). I created this model in Stan (http://mc-stan.org/), which uses a different sampling technique than BUGs and is more efficient for many problems:

data {
  int<lower=0>  N; //number of data points
  int<lower=0>  C; //number of categories

  real x[N]; // normally distributed component data
  int y[N];  // categorical component data
}
parameters {
  real<lower=0,upper=1> theta; // mixture probability
  real mu[2]; // means for the normal component
  simplex[C] phi[2]; // categorical distributions for the categorical component
}

transformed parameters {
  real log_theta;
  real log_one_minus_theta;
  vector[C] log_phi[2];
  vector[C] alpha;

  log_theta <- log(theta);
  log_one_minus_theta <- log(1.0 - theta);

  for( c in 1:C)
    alpha[c] <- .5;

  for( k in 1:2)
    for( c in 1:C)
        log_phi[k,c] <- log(phi[k,c]);
}
model {
  theta ~ uniform(0,1); // equivalently, ~ beta(1,1);
  for (k in 1:2){
    mu[k] ~ normal(0,10);
    phi[k] ~ dirichlet(alpha);
  }

  for (n in 1:N) {
    lp__ <- lp__ + log_sum_exp(log_theta + normal_log(x[n],mu[1],1) + log_phi[1,y[n]],
                               log_one_minus_theta + normal_log(x[n],mu[2],1) + log_phi[2,y[n]]);
  }
}

I compiled and ran the Stan model and used the parameters from the final sample to compute the probability of each datapoint under each mixture component. I then assigned each datapoint to the mixture component (cluster) with higher probability to recover the cluster assignments below:

enter image description here

Basically, the parameters for each mixture component will give you the core characteristics of each cluster if you have created a model appropriate for your dataset.

The main problem here is not clusterization of data set, but acceptable interpratetion of given clusters. Of course clusterization algorithm can drastically change clusters structure, and suddenly this structure will become plain and simple. But it is too good to be truth. I haven't used BUGS or Stan, but if I understand it right, these are using a Gaussian Mixture Model, so the result of its work is similar to the one EM-algorithm produces. Is it so? — Artem Pianykh, Sep 06 '12 at 09:29
You can extract an interpretation of each cluster from the distribution of each mixture component. In the GMM, each mixture is (multivariate) gaussian and can be described by a mean vector and covariance matrix. This description is probably not ideal for data with categorical components. So you probably want to define custom distributions over mixture components to do this - NOT a gaussian mixture. BUGS and Stan can automate this inference process for custom mixtures so you don't have to derive EM or sampling updates yourself. — user1149913, Sep 06 '12 at 12:03

score 2 · Answer 3 · answered Sep 03 '12 at 15:55

For heterogenous, non-Euclidean data vectors as you describe, hierarchical clustering algorithms often work best. The conditional probability condition you describe can be incorporated as an ordering of attributes used to perform cluster agglomeration or division. The semantics of the resulting clusters are easy to describe.

An understandable clusterization

3 Answers3