2

Recently I watched a lot of Stanford's hilarious Open Classroom's video lectures. Particularly the part about unsupervised Machine Learning got my attention. Unfortunately it stops were it might get even more interesting.

Basically I am looking to classify discrete matrices by an unsupervised algorithm. Those matrices just contain discrete values of the same range. Let's say I have 1000s of 20x15 matrices that with values ranging from 1-3. I just started to read through the literature and I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there.

I also looked at the Machine Learning and Cluster Cran Task Views but do not know where to start with a practical example.

So my question is: which package / algorithm would be a good pick to start playing around and working on the problem in R?

EDIT: I realized that I might have been to imprecise: My matrix contains discrete choice data – so mean clustering might(!) not be the right idea. I do understand with what you said about vectors and observation but I am hoping for some function that accepts matrices or data.frames, because I have several observations over time.

EDIT2: I realize that a package / function, introduction that focuses on unsupervised classification of categorical data is what would help me the most right now.

Amro
  • 123,847
  • 25
  • 243
  • 454
Matt Bannert
  • 27,631
  • 38
  • 141
  • 207
  • 1
    `kmeans` in `library(class)` and `hclust` - these are the two basic ones. – hatmatrix Oct 27 '11 at 22:44
  • does that work for categorical data too ? – Matt Bannert Oct 28 '11 at 08:42
  • 1
    @ran2: any clustering algorithm works for categorical data with the right settings. Using a 1-of-K coding is a Good Idea and if your clustering package offers multiple distance metrics, you might want to try L1 distance instead of Euclidean. – Fred Foo Oct 28 '11 at 08:58
  • @larsmans, thx! do you have a good read / starting on what you just explained? – Matt Bannert Jul 03 '12 at 23:11
  • @ran2: most of my ML knowledge is from practice and discussion with colleagues. I bet you can find a lot of info in [ESL](http://www-stat.stanford.edu/~tibs/ElemStatLearn/), though. – Fred Foo Jul 04 '12 at 09:30

3 Answers3

0

... classify discrete matrices by an unsupervised algorithm

You must mean cluster them. Classification is commonly done by supervised algorithms.

I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there

Without knowing what your matrices represent, it's hard to tell what kind of algorithm you need. But a starting point might be to flatten your 20*15 matrices to produce length-300 vectors; each element of such a vector would then be a feature (or variable) to base a clustering on. This is the way must ML packages, including the Cluster package you link to, work: "In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable."

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Sorry for being to imprecise. Edited my post now. – Matt Bannert Oct 27 '11 at 21:34
  • Historically, what's now generally known as clustering used to be called classification. There's even a fairly well-known textbook by that name: http://www.amazon.com/Classification-Chapman-Monographs-Statistics-Probability/dp/1584880139 – Hong Ooi Oct 28 '11 at 00:09
0

So far I found daisy from the cluster package respectively the argument "gower" which refers to Gower's similarity coefficient to handle multiple modes of data. Gower seems to be a fairly only distance metric, still it's what I found for use with categorical data.

Matt Bannert
  • 27,631
  • 38
  • 141
  • 207
-1

You might want to start from here : http://cran.r-project.org/web/views/MachineLearning.html

iinception
  • 1,945
  • 2
  • 21
  • 19
  • Who +1'd this? I don't downvote because y'all are just trying to help me to get a grip on this. I mean I just posted that link in my original post. If it was meant to say RTFM, writing exactly that would have been honest. – Matt Bannert Oct 28 '11 at 09:48
  • 1
    i didn't notice CRAN ML site referenced in your original post....! – iinception Oct 28 '11 at 21:44
  • Don't worry :) I am not blaming you, just wondered about the random up. In the meantime I got quite some information, but it was relatively (compared to other R issues) hard to find a starting point. there's really sooo much around – especially if you do not know very exactly what you are looking for. Really found some packages and will hopefully learn enough to summarize it here later. – Matt Bannert Oct 28 '11 at 22:49