2

Can the fuzzy c-means applied on non numerical data sets ? i.e categorical or mixed numerical and categorical.. if yes (I hope so :( ):

  • how we calculate cluster centers ?

If NO , what is the alternative .. how to fuzzy clusters these data ?

I need the response please help

NOTE: I've used the Jacard's coefficient to calculate the distance between 2 points but still didn't get the way to calculate the cluster centers see the attachementsenter image description here jacard coefficient

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
AWRAM
  • 333
  • 2
  • 16

1 Answers1

4

You'll have to transform your data into a numeric form. There are various ways of doing that, two of them being:

  • use vectors of feature counts (common in, e.g., text categorization)
  • use a one-hot representation, where a categorical feature that can take on n distinct values is represented as string of n bits, with only the i'th bit set if a feature has the i'th value in its allowed range.

Both are very common transformations that many machine learning programs do under the hood. Also, you might want to experiment with a different metric than the Euclidean one. Esp. with one-hot representation, but depending on the data, the L1 norm (Manhattan/city block distance) may be more appropriate.

Apart from that, just apply the given formulas to your transformed dataset.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • thank u for ure answer, may u please check the updated question – AWRAM Oct 09 '11 at 17:17
  • @AWRAM: I don't think the Jaccard coefficient gives rise to a mean in the general case, so you'll want to switch to either a numeric representation or the [*k*-medoids](http://en.wikipedia.org/wiki/K-medoids) algorithm – Fred Foo Oct 10 '11 at 09:24
  • suppose that we transform features to binary representation e.g I have 3 points in a cluster A having each a membership value to this cluster as follow p1(1000,0.5(membership)) p2(0100,0.7) p3(0001,0.4). How to calculate the mean in this case ? – AWRAM Oct 11 '11 at 22:49
  • @AWRAM: features 1, 2 and 4 occur once in your set of three while feature 3 doesn't occur, so the unweighted mean is [1/3, 1/3, 0, 1/3]. The weighted case follows from this in the usual fashion. – Fred Foo Oct 12 '11 at 08:59
  • what could be the cluster center vj mentioned above ? – AWRAM Oct 12 '11 at 18:45
  • @AWRAM: it's explained by the formula, isn't it? If you don't understand how to read such a formula, you might better pose a new question at math.stackoverflow.com. – Fred Foo Oct 12 '11 at 19:55
  • Thank you for the clarification:) . I know how to read it but when the xi is a binary string is it meaningful to multiply it by the membership value which is a float between 0 and 1 and would it give a valid cluster center? – AWRAM Oct 12 '11 at 21:09
  • Yes, it will give you the mean of the cluster. Try doing some dot products of binary vectors with real vectors, on paper, in MATLAB, NumPy, whatever -- you'll see that they make sense. – Fred Foo Oct 12 '11 at 21:22