Automatically learning clusters

Question

HI complete newbie question here: I have a table consisting of two columns. First column belongs to "bins" that are coded by where a the fruit flies live. The second column is either 0 or 1, neutral vs really like sugar, respectively. I have two question?

1) if I suspect that there is a single variable, something about where they live that is determining whether how much they like sugar. Is there a way that I can have the computer to group into just 2 clusters? All the bins that like sugar vs neutral. That way we can do further experiment to determine what is it about the bins.

2) automatically determine how many clusters there might be that is driving this behavior? For example may be there is 4 variables (4 clusters) that can determine the outcome of sugar preference.

Apologies if this is trivial. The table is listed below. thanks!

1) the question is not clear, if you're asking whether you can classify, given a bin, a fly that likes sugar or not, than the answer is yes. 2) again not clear, do you want to find clusters of bins based on the "likes sugar" column? Then again the answer is yes. Could you be a little more clear with your questions? — mp85, Feb 22 '14 at 04:43
@mp85 sorry for the bad wording. So for 1) what I want is to have two clusters; each representing neutral or like sugar. In each of the clusters there should be a list of bins. If I dummy code all the bins that say fall into neutral as 1 and all the bins in like sugar as zero, then when I run a regression it should give me the strongest possible prediction. For the second, it would be nice if the computer can tell me what is the optimal clusters to have and within those cluster which bins fall under it. When I do the similar regression as above should give the strongest relationship. — Ahdee, Feb 22 '14 at 05:02
Don't think of cluster analysis as "learning" some variable. Then you are doing classification, not structure discovery. For cluster analysis, think about defining *structure*. — Has QUIT--Anony-Mousse, Feb 22 '14 at 15:20

mp85 · Answer 1 · 2014-02-22T14:09:21.693

Okay, assuming I understood what you meant, one approach to problem 1) should be addressed using bayes filtering. Say event L is "a fly likes sugar", event B is "a fly is in bin B".

So what you have is:

number of flies = 84    
size of each bins = (eg size of bin 1: 4)

probability that a fly likes sugar:

P(L) = flies that like sugar / total number of flies = 43/84

probability that a fly doesn't like sugar:

P(notL) = 1 - P(L) = 41/84

probability that a fly is in a given bin:

P(B) = size of the bin / sum of the sizes of all bins = 4/84 (for bin 1)

probability that a fly isn't in a given bin:

P(notB) = 1 - P(B) = 80/84 (for bin 1)

probability that a fly likes sugar, knowing that's in bin B:

P(L|B) = flies that like sugar in a bin / size of the bin
(eg for bin 1 is 2/4 = 1/2)

probability that a fly likes sugar, knowing that it's not in bin B:

P(L|notB) = (total flies that like sugar - flies that like sugar in the bin)/(size of bins - size of the bin)) = 41/80

You want to know the probability that a fly is in a given bin B knowing that likes sugar, which you can obtain with:

P(B|L) = (P(L|B) * P(B)) / (P(L|B) * P(B) + P(L|notB) * P(notB))

If you compute P(B|L) and P(B|notL) for each bin, then you know which of the bins have the highest probability of containing flies that like sugar. Then you can further study those bins.

Hope i was clear, my statistics is a bit rusty and I'm not even sure I am doing everything correctly. Take it as a hint to point you in the right direction to address the problem.

You can refer here to get more accurate reasoning and results.

As for problem 2)... I have to think about it a bit more.

thanks this is clever to utilize - I think it'll work for me well. — Ahdee, Feb 22 '14 at 17:18

Automatically learning clusters

1 Answers1