I have to crawl Wikipedia to get HTML pages of countries. I have successfully crawled. Now to build clusters, I have to do KMeans. I am using Weka for that.
I have used this code to convert my directory into arff format: https://weka.wikispaces.com/file/view/TextDirectoryToArff.java Here is its output: enter image description here
Then I opened that file in Weka and performed StringToWordVector conversion with these parameters: Then I performed Kmeans. The output I am getting is:
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 5000 -S 10
Relation: text_files_in_files-weka.filters.unsupervised.attribute.StringToWordVector-R1,2-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"-weka.filters.unsupervised.attribute.StringToWordVector-R-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"
Instances: 28
Attributes: 1040
[list of attributes omitted]
Test mode:evaluate on training data
=== Model and evaluation on training set ===
kMeans
Number of iterations: 2 Within cluster sum of squared errors: 1915.0448503841326 Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(28) (22) (6)
====================================================================================
.
.
.
.
.
bolsheviks 0.3652 0.3044 0.5878
book 0.3229 0.3051 0.3883
border 0.4329 0.5509 0
border-left-style 0.4329 0.5509 0
border-left-width 0.3375 0.4295 0
border-spacing 0.3124 0.3304 0.2461
border-width 0.5128 0.2785 1.372
boundary 0.309 0.3007 0.3392
brazil 0.381 0.3744 0.4048
british 0.4387 0.2232 1.2288
brown 0.2645 0.2945 0.1545
cache-control=max-age=87840 0.4913 0.4866 0.5083
california 0.5383 0.5085 0.6478
called 0.4853 0.6177 0
camp 0.4591 0.5451 0.1437
canada 0.3176 0.3358 0.251
canadian 0.2976 0.1691 0.7688
capable 0.2475 0.315 0
capita 0.388 0.1188 1.375
carbon 0.3889 0.445 0.1834
caribbean 0.4275 0.5441 0
carlsbad 0.548 0.5339 0.5998
caspian 0.4737 0.5345 0.2507
category 0.2216 0.2821 0
censorship 0.2225 0.0761 0.7596
center 0.4829 0.4074 0.7598
central 0.211 0.0805 0.6898
century 0.2645 0.2041 0.4862
chad 0.3636 0.0979 1.3382
challenger 0.5008 0.6374 0
championship 0.6834 0.8697 0
championships 0.2891 0.1171 0.9197
characteristics 0.237 0 1.1062
charon 0.5643 0.4745 0.8934
china
.
.
.
.
.
Time taken to build model (full training data) : 0.05 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 22 ( 79%)
1 6 ( 21%)
How to check which DocId is in which cluster? I have searched a lot but didnt find anything.
Also, is there any other good Java Library for Kmeans and agglomerate clustering?