ML / density clustering on house areas. two-component or more mixtures in each dimension

Question

I trying to self-learn ML and came across this problem. Help from more experienced people in the field would be much appreciated!

Suppose i have three vectors with areas for house compartments such as bathroom, living room and kitchen. The data consists of about 70,000 houses. A histogram of each individual vector clearly has evidence for a bimodal distribution, say a two-component gaussian mixture. I now wanted some sort of ML algorithm, preferably unsupervised, that would classify houses according to these attributes. Say: large bathroom, small kitchen, large living-room.

More specifically, i would like an algorithm to choose the best possible separation threshold for each bimodal distribution vector, say large/small kitchen (this can be binary as there we assume evidence for a bimodality), do the same for others and cluster the data. Ideally this would come with some confidence measure so that i could check houses in the intermediate regimes... for instance, a house with clearly a large kitchen, but whose bathroom would fall close to a threshold area/ boundary for large/small bathroom would be put for example on the bottom of a list with "large kitchens and large bathrooms". Because of this reason, first deciding on a threshold (fitting the gausssians with less possible FDR), collapsing the data and then clustering would Not be desirable.

Any advice on how to proceed? I know R and python.

Many thanks!!

score 1 · Answer 1 · answered Apr 03 '13 at 10:06

1

What you're looking for is a clustering method: this is basically unsupervised classification. A simple method is k-means, which has many implementations (k-means can be viewed as the limit of a multi-variate Gaussian mixture as the variance tends to zero). This would naturally give you a confidence measure, which would be related to the distance metric (Euclidean distance) between the point in question and the centroids.

One final note: I don't know about clustering each attribute in turn, and then making composites from the independent attributes: why not let the algorithm find the clusters in multi-dimensional space? Depending on the choice of algorithm, this will take into account covariance in the features (big kitchen increases the probability of big bedroom) and produce natural groupings you might not consider in isolation.

answered Apr 03 '13 at 10:06

Ben Allison

7,244
1
15
24

thanks Ben! Trying this out atm... will come back to you soon. Although... i've been told that k-means didn't perform well in this case. Curious to see why... – HslashML Apr 08 '13 at 12:10
Just be sure that each instance is a house, and it's represented as the area of the 3 rooms you care about (or whatever features you want to define), then run it through whatever clustering algorithm you like (k-means is just an example). Because the data is so low dimensional, you should even be able to visualise it – Ben Allison Apr 08 '13 at 14:20
I did do a 3D plot of the data on a log10 scale (i'm clustering on the log10 values, don't know if this may be a problem). I see 3-7 more or less separable density regions in the space, some more dense and localised, some more spread out. I wanted a method capable of capturing this. I tried DBSCAN but prob due to the differences in density, it gave no good results. So far the most exciting results were with OPTICS. Most of the clusters can be well perceived in the reachability-plot. – HslashML Apr 22 '13 at 13:06
This said, one of the regions/clusters - quite a big one - results in a useless smear across the ordered plot. You can sort of see two trends in it but not a clear-cut case.) I also found other methods like EM, AUTO-HDS, DiSH, HiSC and INCONCO. Anyone has any experiences with any of these you would like to share? The basic idea would be to have a statistical method that decides on boundaries for these blob density areas I can observe and classify each and all of the data points (no noise). Thanks! – HslashML Apr 22 '13 at 13:07

score 1 · Answer 2 · answered Apr 22 '13 at 15:36

1

Sounds like you want EM clustering with a mixture of Gaussians model.

Should be in the mclust package in R.

answered Apr 22 '13 at 15:36

Has QUIT--Anony-Mousse

76,138
12
138
194

score 0 · Answer 3 · answered May 07 '13 at 08:24

In addition to what the others have suggested, it is indeed possible to cluster (maybe even density-based clustering methods such as DBSCAN) on the individual dimensions, forming one-dimensional clusters (intervals) and working from there, possibly combining them into multi-dimensional, rectangular-shaped clusters.

I am doing a project involving exactly this. It turns out there are a few advantages to running density-based methods in one dimension, including the fact that you can do what you are saying about classifying objects on the border of one attribute according to their other attributes.

ML / density clustering on house areas. two-component or more mixtures in each dimension

3 Answers3