3

I have to crawl Wikipedia to get HTML pages of countries. I have successfully crawled. Now to build clusters, I have to do KMeans. I am using Weka for that.

I have used this code to convert my directory into arff format: https://weka.wikispaces.com/file/view/TextDirectoryToArff.java Here is its output: enter image description here

Then I opened that file in Weka and performed StringToWordVector conversion with these parameters: Then I performed Kmeans. The output I am getting is:

    === Run information ===

    Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 5000 -S 10
    Relation:     text_files_in_files-weka.filters.unsupervised.attribute.StringToWordVector-R1,2-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"-weka.filters.unsupervised.attribute.StringToWordVector-R-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"
    Instances:    28
    Attributes:   1040
    [list of attributes omitted]
    Test mode:evaluate on training data

=== Model and evaluation on training set ===

kMeans

Number of iterations: 2 Within cluster sum of squared errors: 1915.0448503841326 Missing values globally replaced with mean/mode

Cluster centroids:

                                                              Cluster#
Attribute                                            Full Data          0          1
                                                          (28)       (22)        (6)
====================================================================================
.
.
.
.
.
bolsheviks                                              0.3652     0.3044     0.5878
book                                                    0.3229     0.3051     0.3883
border                                                  0.4329     0.5509          0
border-left-style                                       0.4329     0.5509          0
border-left-width                                       0.3375     0.4295          0
border-spacing                                          0.3124     0.3304     0.2461
border-width                                            0.5128     0.2785      1.372
boundary                                                 0.309     0.3007     0.3392
brazil                                                   0.381     0.3744     0.4048
british                                                 0.4387     0.2232     1.2288
brown                                                   0.2645     0.2945     0.1545
cache-control=max-age=87840                             0.4913     0.4866     0.5083
california                                              0.5383     0.5085     0.6478
called                                                  0.4853     0.6177          0
camp                                                    0.4591     0.5451     0.1437
canada                                                  0.3176     0.3358      0.251
canadian                                                0.2976     0.1691     0.7688
capable                                                 0.2475      0.315          0
capita                                                   0.388     0.1188      1.375
carbon                                                  0.3889      0.445     0.1834
caribbean                                               0.4275     0.5441          0
carlsbad                                                 0.548     0.5339     0.5998
caspian                                                 0.4737     0.5345     0.2507
category                                                0.2216     0.2821          0
censorship                                              0.2225     0.0761     0.7596
center                                                  0.4829     0.4074     0.7598
central                                                  0.211     0.0805     0.6898
century                                                 0.2645     0.2041     0.4862
chad                                                    0.3636     0.0979     1.3382
challenger                                              0.5008     0.6374          0
championship                                            0.6834     0.8697          0
championships                                           0.2891     0.1171     0.9197
characteristics                                          0.237          0     1.1062
charon                                                  0.5643     0.4745     0.8934
china                                                  
.
.
.
.
.


Time taken to build model (full training data) : 0.05 seconds

=== Model and evaluation on training set ===

Clustered Instances

0      22 ( 79%)
1       6 ( 21%)

How to check which DocId is in which cluster? I have searched a lot but didnt find anything.

Also, is there any other good Java Library for Kmeans and agglomerate clustering?

Albert Pinto
  • 392
  • 2
  • 6
  • 17
Siddharth
  • 83
  • 1
  • 1
  • 4
  • 1
    Possible duplicate of [Weka simple K-means clustering assignments](http://stackoverflow.com/questions/6685961/weka-simple-k-means-clustering-assignments) – SJB Dec 07 '15 at 13:15

0 Answers0