17

So you have an array

1
2
3
60
70
80
100
220
230
250

For a better understanding:

For better understanding

How would you group/cluster the three areas in arrays in python(v2.6), so you get three arrays in this case containing

[1 2 3] [60 70 80 100] [220 230 250]

Background:

y-axis is frequency, x-axis is number. These numbers are the ten highest amplitudes being represented by their frequencies. I want to create three discrete numbers from them for pattern recognition. There could be many more points but all of them are grouped by a relatively big frequency difference as you can see in this example between about 50 and about 0 and between about 100 and about 220. Note that what is big and what is small changes but the difference between clusters remains significant compared to the difference between elements of a group/cluster.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Zurechtweiser
  • 1,165
  • 2
  • 16
  • 29
  • 8
    This is not specifically a Python problem. You'd first have to choose an appropriate clustering-algorithm and see how you can implement in in Python (or if it is already implemented, for instance in SciPy). – Björn Pollex Jan 20 '12 at 10:43
  • 1
    If the problem and dataset is always like this, you could use a "home made" heuristic yourself, and fine tune it to work on your data. But if the complexity would be a bit more than this, I think you cannot be spared of studying the many good suggestions and algorithms pointed down in the answers. – heltonbiker Jan 20 '12 at 12:54
  • It is not always 'like this'. Differences are: 1. more numbers. 2. different gaps between the clusters. 3. Different gaps between the elements in clusters. What remains though is that the difference between element gaps and cluster gaps is significant or in other words: Delta(elements) << Delta(cluster) – Zurechtweiser Jan 20 '12 at 14:27
  • In fact, stats.stackexchange.com would be a better place to ask, and there probaby are already a couple of duplicates there. – Has QUIT--Anony-Mousse Jan 20 '12 at 19:23

5 Answers5

18

Observe that your data points are actually one-dimensional if x just represents an index. You can cluster your points using Scipy's cluster.vq module, which implements the k-means algorithm.

>>> import numpy as np
>>> from scipy.cluster.vq import kmeans, vq
>>> y = np.array([1,2,3,60,70,80,100,220,230,250])
>>> codebook, _ = kmeans(y, 3)  # three clusters
>>> cluster_indices, _ = vq(y, codebook)
>>> cluster_indices
array([1, 1, 1, 0, 0, 0, 0, 2, 2, 2])

The result means: the first three points form cluster 1 (an arbitrary label), the next four form cluster 0 and the last three form cluster 2. Grouping the original points according to the indices is left as an exercise for the reader.

For more clustering algorithms in Python, check out scikit-learn.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 7
    I don't like the phrase "is left as an exercise for the reader." due to it's arrogance. – Zurechtweiser Jan 20 '12 at 14:37
  • 4
    @RichartBremer: the phrase just indicates that I'm too busy/lazy to solve the list-processing drudgery and I trust that you can solve that yourself. It would also distract from the core of the answer. I don't see what's arrogant about it, I certainly didn't mean to be arrogant. – Fred Foo Jan 20 '12 at 14:43
  • 1
    Ok, it seems to be a simple misunderstanding. – Zurechtweiser Jan 20 '12 at 16:24
  • Try this on the data set `array=list(range(15))` if you find the result very convincing. K-means is NOT a good choice for one-dimensional data, in particular when you do not know `k`. In fact, the only good thing to say about k-means is that it is very simple to implement. – Has QUIT--Anony-Mousse Jan 20 '12 at 19:04
  • @Anony-Mousse: that's a very unfair dataset for *k*-means. It works well on the OP's sample data; I don't know how the rest of the their data looks. – Fred Foo Jan 20 '12 at 19:38
  • Well, it just shows how much k-means actually *relies* on the data being well behaved, and `k` being appropriate. But let me give you another dataset: `[1,2,3,10,20,30,1000,2000,3000]`. Of course this again is unfair... – Has QUIT--Anony-Mousse Jan 20 '12 at 20:47
  • 3
    I'm not trying to make your answer appear bad. k-means doesn't make that much sense on 1-dimensional data, but my points are that a) k-means makes a lot of implicit assumptions on the data: clusters being equal in numerical size and `k` being known; b) clustering is not just about grouping objects, but actually about grouping objects in a way that makes sense for the particular task to solve, and this cannot be answered by the algorithm, but by the domain expert. – Has QUIT--Anony-Mousse Jan 20 '12 at 20:55
17

This is a simple algorithm implemented in python that check whether or not a value is too far (in terms of standard deviation) from the mean of a cluster:

from math import sqrt

def stat(lst):
    """Calculate mean and std deviation from the input list."""
    n = float(len(lst))
    mean = sum(lst) / n
    stdev = sqrt((sum(x*x for x in lst) / n) - (mean * mean)) 
    return mean, stdev

def parse(lst, n):
    cluster = []
    for i in lst:
        if len(cluster) <= 1:    # the first two values are going directly in
            cluster.append(i)
            continue

        mean,stdev = stat(cluster)
        if abs(mean - i) > n * stdev:    # check the "distance"
            yield cluster
            cluster[:] = []    # reset cluster to the empty list

        cluster.append(i)
    yield cluster           # yield the last cluster

This will return what you expect in your example with 5 < n < 9:

>>> array = [1, 2, 3, 60, 70, 80, 100, 220, 230, 250]
>>> for cluster in parse(array, 7):
...     print(cluster)
[1, 2, 3]
[60, 70, 80, 100]
[220, 230, 250]
Rik Poggi
  • 28,332
  • 6
  • 65
  • 82
  • array = [1, 2, 3, 4, 60, 70, 80, 100, 220, 230, 250] makes the code divide into two arrays 1->3 and 4->250. – Zurechtweiser Jan 20 '12 at 14:42
  • 2
    @RichartBremer: The problem was that I tested that in python3, while in python2 `sum(lst) / n` with `n` integer gave as a result an integer so `mean` was `1` instead of `1.5`. Converting `len(lst)` to `float` resolve the issue *(I edited the code)*. – Rik Poggi Jan 20 '12 at 15:08
  • This is probably the most sensible of the methods proposed so far (e.g. run kmeans on range(1,15)). However, you still should spend some thoughts on what you want to achieve. There are many methods that will produce such a split of the array; which one is appropriate depends a lot on what you are using it for and what your real data looks like. +1 for this answer, for not just using kmeans because it is clustering, but actually considering the problem. – Has QUIT--Anony-Mousse Jan 20 '12 at 19:22
  • I is been a long time since last post but can you think of this code using dictionary inside a dictionary instead of array=[1,2,3...] – billwild May 14 '13 at 13:22
  • @RikPoggi would you mind taking a look at my question, which uses you code here: http://stackoverflow.com/questions/18721774/python-cluster-variables-in-list-of-tuples-by-2-factors-silmutanously – Irek Sep 10 '13 at 15:32
  • I know it's been a long time, great solution but is it possible to automate `n`? one fixed number doesn't always give correct results. for example let say `n=7` works most of the case but for this array `[130, 167, 213, 441, 445, 451, 478, 515, 526, 564, 655, 782, 1261]` it doesn't group properly. `3,9,1` would be the best for me but it's `3,3,6,1`. – Ergec May 13 '17 at 20:15
8

I assume you want a pretty-good-but-simple algorithim here.

If you know you want N clusters, then you can take the differences (deltas) between consecutive members of the (sorted) input list. E.g. in numpy:

 deltas = diff( sorted(input) )

Then you can place your cuttoffs where you find the N-2 biggest differences.

Things are trickier if you don't know what N is. Here you might place the cuttoffs whenever you see a delta greater than a certain size. This will then be a hand-tuned parameter, which is not great, but might be good enough for you.

Adrian Ratnapala
  • 5,485
  • 2
  • 29
  • 39
7

You can solve this in various ways. One of the obvious ones when you throw the keyword "clustering" is to use kmeans (see other replies).

However, you might want to first understand more closely what you are actually doing or attempting to do. Instead of just throwing a random function on your data.

As far as I can tell from your question, you have a number of 1-dimensional values, and you want to separate them into an unknown number of groups, right? Well, k-means might do the trick, but in fact, you could just look for the k largest differences in your data set then. I.e. for any index i > 0, compute k[i] - k[i-1], and choose the k indexes where this is larger than for the rest. Most likely, your result will actually be better and faster than using k-means.

In python code:

k = 2
a = [1, 2, 3, 60, 70, 80, 100, 220, 230, 250]
a.sort()
b=[] # A *heap* would be faster
for i in range(1, len(a)):
  b.append( (a[i]-a[i-1], i) )
b.sort()
# b now is [... (20, 6), (20, 9), (57, 3), (120, 7)]
# and the last ones are the best split points.
b = map(lambda p: p[1], b[-k:])
b.sort()
# b now is: [3, 7]
b.insert(0, 0)
b.append(len(a) + 1)
for i in range(1, len(b)):
  print a[b[i-1]:b[i]],
# Prints [1, 2, 3] [60, 70, 80, 100] [220, 230, 250]

(This can btw. be seen as a simple single-link clustering!)

A more advanced method, that actually gets rid of the parameter k, computes the mean and standard deviation of b[*][1], and splits whereever the value is larger than say mean+2*stddev. Still this is a rather crude heuristic. Another option would be to actually assume a value distribution such as k normal distributions, and then use e.g. Levenberg-Marquardt to fit the distributions to your data.

But is that really what you want to do?

First try to define what should be a cluster, and what not. The second part is much more important.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

You could use nearest neighbor clustering. For a point to belong to one of the clusters, its nearest neighbor must also belong to the cluster. With the case you've shown, you'd just need to iterate along the x-axis and compare the differences to the adjacent points. When the difference to the previous point is greater than the difference to the next point, it indicates the start of a new cluster.

Michael J. Barber
  • 24,518
  • 9
  • 68
  • 88