I would like to use R to perform a hierarchical clustering of data that looks like this:
L1 L2 L3
W1 p pr r
W2 p NA r
which is supposed to mean that L2 shares feature W1 with both L1 and L3, while feature W2 is present in L1 and L3, but with a different value, and missing from L2. (edit L’s are languages, W’s are stems of words in these languages, and the values (p, r, etc.) describe how these words can be derived in the specific language. I believe that a word’s being derived in the same way in different languages might suggest common origin. When it's missing, it's not clear: it might mean something, or my sources might be incomplete, but I guess I'll have to cautiously assume the first option. The eventual goal is to classify languages by what stems are present in them, and how they behave. end edit)
Can you please explain how I can transform this data so as to be able to perform classification on them, and advise on what similarity index I should use?