Preparation of categorical data for hierarchical clustering

Question

I would like to use R to perform a hierarchical clustering of data that looks like this:

     L1   L2   L3
W1   p    pr   r
W2   p    NA   r

which is supposed to mean that L2 shares feature W1 with both L1 and L3, while feature W2 is present in L1 and L3, but with a different value, and missing from L2. (edit L’s are languages, W’s are stems of words in these languages, and the values (p, r, etc.) describe how these words can be derived in the specific language. I believe that a word’s being derived in the same way in different languages might suggest common origin. When it's missing, it's not clear: it might mean something, or my sources might be incomplete, but I guess I'll have to cautiously assume the first option. The eventual goal is to classify languages by what stems are present in them, and how they behave. end edit)

Can you please explain how I can transform this data so as to be able to perform classification on them, and advise on what similarity index I should use?

Can you please be clearer about which are the objects to be clustered? — Michele, Apr 26 '13 at 12:41
@Michele I've edited the post and briefly explained what my data represent. Please let me know if a more detailed explanation is needed. — Kamil S., Apr 26 '13 at 14:25
In the first instance, could you use the same value for all three, or does `pr` mean that L2 has two values for this feature? In the latter case, an enumeration like `feature(L1,W1,p); feature(L2,W1,p); feature(L2,W1,r); feature(L3,W1,r)` might make more sense than a single matrix over W and L. — tripleee, Apr 26 '13 at 14:30
@tripleee It means that L2 has two values. Where can I find out more about how to code such an enumeration in R? I don't seem to be able to find it with Google. — Kamil S., Apr 26 '13 at 14:52
Just change the layout of your table, maybe rows of W and values indicating (language,feature) tuples for that W? Or using language as the primary key might make more sense from a usability point of view. — tripleee, Apr 26 '13 at 15:16

Preparation of categorical data for hierarchical clustering

0 Answers0