Data pre-processing for input data when clustering with CLUTO

Question

I am trying to clustering some words based on their similarities(between two words) Some part of my data is as below (it's just example "animal.txt", it's similar with adjacency matrix).

    cat dog horse ostrich 
cat  5    4    3    2
dog  4    5    1    2
horse 3   1    5    4
ostrich 2  2   4    5

The bigger number means that the similarity between two words is higher. Based on this kind of format data, I want to make a clusters. (for example, if I want to make 2 clusters then the result will be (cat, dog), (horse,ostrich)).

I tried to use CLUTO... to make some clusters.

At first, I have to re-construct the input file before doing CLUTO clustering. So, I used the doc2mat (http://glaros.dtc.umn.edu/gkhome/files/fs/sw/cluto/doc2mat.html).. but I don't know how I can use this properly for making CLUTO input files (like mat, label files) And after making CLUTO input files, then how I can make clusters based on above data?

What data you want to see on the output of the preprocessing script? — alex, Dec 27 '13 at 17:28
After pre-processing with doc2mat, I want mat file, and column, row files. Those are the input for CLUTO. — GoodGJ, Dec 27 '13 at 17:35

score 0 · Answer 1 · answered Jan 13 '17 at 15:13

Since your data are an adjacency matrix, the corresponding CLUTO input file is a so-called GraphFile, not a MatrixFile, and thus doc2mat doesn't help.

This program txt2graph.pl converts a file like your example "animal.txt" to a Graph File and a Row Label File:

#!/usr/bin/perl
@F = split ' ', <>;             # begin reading txt file, read column headers
($GraphFile = $ARGV) =~ s/(.txt)?$/.graph/;
$LabelFile = $GraphFile.".rlabel";
open LABEL, ">$LabelFile";
open GRAPH, ">$GraphFile";
print GRAPH $#F+1, "\n";        # output number of vertices=objects=columns=rows
while (<>)
{                               # process each object row
    @F = split ' ', $_, 2;      # split into name, numbers
    print LABEL shift @F, "\n"; # output name
    print GRAPH @F;             # output numbers
}

After the CLUTO clustering is done, this program pclusters.pl prints the result in your desired output format:

#!/usr/bin/perl
($LabelFile = $ARGV[0]) =~ s/(.clustering.\d+)?$/.rlabel/;
open LABEL, $LabelFile; chomp(@label = <LABEL>); close LABEL;   # read labels
while (<>)
{
    $cluster[$_] = [] unless $cluster[$_];      # initialize a new cluster
    push $cluster[$_], $label[$.-1];            # add label to its cluster
}
foreach $cluster (@cluster)
{
    print "(", join(', ', @$cluster), ")\n";    # print a cluster's labels
}

The whole procedure is then:

> txt2graph.pl animal.txt
> scluster animal.graph 2
> pclusters.pl animal.graph.clustering.2

Data pre-processing for input data when clustering with CLUTO

1 Answers1