How to compute histograms using weka

Question

Given a dataset with 23 points spread out over 6 dimensions, in the first part of this exercise we should do the following, and I am stuck on the second half of this:

Compute the first step of the CLIQUE algorithm (detection of all dense cells). Use three equal intervals per dimension in the domain 0..100,and consider a cell as dense if it contains at least five objects.

Now this is trivial and simply a matter of counting. The next part asks the following though:

Identify a way to compute the above CLIQUE result by only using the functions of Weka provided in the tabs of Preprocess, Classify , Cluster , or Associate . Hint : Just two tabs are needed.

I've been trying this for over an hour now, but I can't seem to get anywhere near a solution here. If anyone has a hint, or maybe a useful tutorial which gives me a little more insight into weka it would be very much appreciated!

This is just a guess, but I have a hunch "Cluster" is one of those tabs. ...Why isn't there a "homework.stackexchange.com" yet, anyway? — JAB, Jun 05 '12 at 20:17
Thank you for the reply. I thought so myself, the available cluster algorithms in the cluster tab are fairly limited, so I did pretty much try them all with some parameters. Sadly I was not able to get to the right values. I figure I need the preprocess tab to select the different values and maybe do some normalization or something similar, and then either the cluster or classify tab to get to the cells. Sadly the different possible combinations is huge here... :( — Pandoro, Jun 05 '12 at 20:31

score 2 · Accepted Answer · answered Jun 09 '12 at 09:52

I am assuming you have 23 instances (rows) and 6 attributes (dimensions)

Use three equal intervals per dimension

Use pre-process tab to discretize your data to 3 equal bins. See image or command line. You use 3 bins for intervals. You may choose to change useEqualFrequency to false and true and try again. I think true may give better results.

weka.filters.unsupervised.attribute.Discretize -B 3 -M -1.0 -R first-last

unsupervised.attribute.Discretize

After that cluster your data. This will give show you near instances. Since you would like to find dense cells. I think SOM may be appropriate.

a cell as dense if it contains at least five objects.

You have 23 instances. Therefore try for 2x2=4 cluster centers, then go for 2x3=6,2x4=8 and 3x3=9. If your data points are near. Some of the cluster centers should always hold 5 instances no matter how many cluster centers your choose.

Thank you for your reply! I wasn't able to fully use it since I couldn't find the SOM clustering algorithm and I do not believe we can add new ones for this exercise :( However you brought me on the right track! The problem was that using discretize, the bins were not nicely divided into 0-33 , 33-66, 66-100. So first I edited the min and max points in the data, then I used discretize and then just DBSCAN to get the dense cells. This is not pretty, nor scientific, but it's the only way I found to inform discretize of the actual min and max of the data. Thanks a lot ! — Pandoro, Jun 11 '12 at 08:58
You are using Developer Version 3.7.2+. You need to use Package Manager to install SOM algorithm. — Atilla Ozgur, Jun 11 '12 at 09:05

How to compute histograms using weka

1 Answers1