17

I've ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And they both give some sort of binary and another integer for each unique token. For example:

0        the        6
10        chased        3
110        dog        2
1110        mouse        2
1111        cat        2

What does the binary and the integer mean?

From the first link, the binary is known as a bit-string, see http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/

But how do I tell from the output that dog and mouse and cat is one cluster and the and chased is not in the same cluster?

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    in the first link you present, it says that each line is: !! – carla Jan 08 '14 at 15:00
  • 1
    what does it even mean? cluster represented as bit string? – alvas Jan 08 '14 at 15:01
  • Can you give some details about what exactly you want to classify? In this case I could try to look for some references. Otherwise, there might not be any general procedure and I suppose it's more about expert knowledge and/or predefined measures. – Łukasz Kidziński Jan 16 '14 at 10:53
  • I need to extract semantically related clusters out of an unannotated corpus. – alvas Jan 17 '14 at 04:06
  • Sure, that's the idea of clustering but those hierarchical algorithms just give you hierarchy. In the example you gave it is not clear if dog mouse and cat should be in one cluster or not. It just depends on the requested level of granularity. – Łukasz Kidziński Jan 18 '14 at 09:10
  • Sometimes it's clear sometimes it is not so if you give more details about your dataset, we can try to work it out. – Łukasz Kidziński Jan 18 '14 at 09:10
  • @ŁukaszKidziński, of course the checkmark goes to you, you gave the best explanation of that cryptic output. Thank you for the explanation instead =) – alvas Jan 18 '14 at 12:06

5 Answers5

18

If I understand correctly, the algorithm gives you a tree and you need to truncate it at some level to get clusters. In case of those bit strings, you should just take first L characters.

For example, cutting at the second character gives you two clusters

10           chased     

11           dog        
11           mouse      
11           cat        

At the third character you get

110           dog        

111           mouse      
111           cat        

The cutting strategy is a different subject though.

Łukasz Kidziński
  • 1,613
  • 11
  • 20
  • do you have any links /tutorials on how the `cutting` strategy? – alvas Jan 09 '14 at 10:39
  • Sometimes you have some expert knowledge that there are just `K` clusters and you cut as soon as you get them. Otherwise you define some measure, [wikipedia article](http://en.wikipedia.org/wiki/Hierarchical_clustering) is a good place to start. – Łukasz Kidziński Jan 09 '14 at 12:12
4

In Percy Liang's implementation (https://github.com/percyliang/brown-cluster), the -C parameter allows you to specify the number of word clusters. The output contains all the words in the corpus, together with a bit-string annotating the cluster and the word frequency in the following format: <bit string> <word> <word frequency>. The number of distinct bit strings in the output equals the number of desired clusters and the words with the same bit string belong to the same cluster.

Paul Baltescu
  • 2,095
  • 4
  • 25
  • 30
4

Change your running : ./wcluster --text input.txt --c 3

--c number

this number means the number of cluster, and the default is 50. You can't distinguish the different cluster of words because the default input has only three sentences. Change 50 clusters to 3 clusters and you can tell the difference.

I enter three tweets into the input and give 3 as the cluster parameter

enter image description here

Hao Lyu
  • 176
  • 1
  • 5
1

The integers are counts of how many times the word is seen in the document. (I have tested this in the python implementation.)

From the comments at the top of the python implementation:

Instead of using a window (e.g., as in Brown et al., sec. 4), this code computed PMI using the probability that two randomly selected clusters from the same document will be c1 and c2. Also, since the total numbers of cluster tokens and pairs are constant across pairs, this code use counts instead of probabilities.

From the code in the python implementation we see that it outputs the word, the bit string and the word counts.

def save_clusters(self, output_path):
    with open(output_path, 'w') as f:
        for w in self.words:
            f.write("{}\t{}\t{}\n".format(w, self.get_bitstring(w),
                                          self.word_counts[w]))
Jason Lv
  • 101
  • 3
0

My guess is:

According to Figure 2 in Brown et al 1992, the clustering is hierarchical and to get from the root to each word "leaf" you have to make an up/down decision. If up is 0 and down is 1, you can represent each word as a bit string.

From https://github.com/mheilman/tan-clustering/blob/master/class_lm_cluster.py :

# the 0/1 bit to add when walking up the hierarchy
# from a word to the top-level cluster
cyborg
  • 9,989
  • 4
  • 38
  • 56
  • yep but that doesn't give me clusters, it only give me similarity right? – alvas Jan 08 '14 at 15:15
  • The set of clusters that the word is included in is equivalent to the set of bit-string prefixes. So the word with bit string 1110 is included in clusters 1, 11, and 111. – cyborg Jan 08 '14 at 15:22