-2

Let us supposed that we are trying to rank the importance of each feature of the dataset for each given cluster, in a clustering task. What are the characteristics that we should measure in the feature for considering it good for characterizing a given cluster?

I am looking for a more analytical characterization of these features. For example, if a feature f have a high standard deviation in the whole dataset, but a small standard deviation within a cluster c, does this means that this feature is important for distinguishing the cluster c?

Zaratruta
  • 2,097
  • 2
  • 20
  • 26

1 Answers1

1

There are two approaches you could use here:

  • A feature selection approach would be to remove the said feature and redo the clustering and see if it had strong effect, if no you could say this feature is unnecessary for the clustering task. The down side of this approach is the time it would take to run the clustering process for each subset of features in the dataset.
  • A statistical approach would be to split the data into two groups: the samples from the cluster and the rest of the samples. Then you ask how different are the feature values when comparing the two populations. Depends on the distribution of this feature, you could pick for this task a test like KS test, t test, chi-squared test or any other test for comparing distributions of two samples.
David Taub
  • 734
  • 1
  • 7
  • 27
  • Ok. This would be a kind of data driven way of measuring the relevance of a feature. But I was thinking about a more analytical way. Let's say, what are the statistical properties that a feature should have for being important for characterizing a given cluster? – Zaratruta Dec 26 '17 at 16:25