32

I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs).

I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster in the list contains a list of (x, y) pairs belonging to that cluster.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Murat Derya Özen
  • 2,154
  • 8
  • 31
  • 44

7 Answers7

10

I'm not aware of a complete and exact python implementation of OPTICS. The links posted here seem just rough approximations of the OPTICS idea. They also do not use an index for acceleration, so they will run in O(n^2) or more likely even O(n^3).

OPTICS has a number of tricky things besides the obvious idea. In particular, the thresholding is proposed to be done with relative thresholds ("xi") instead of absolute thresholds as posted here (at which point the result will be approximately that of DBSCAN!).

The original OPTICS paper contains a suggested approach to converting the algorithm's output into actual clusters:

http://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/OPTICS.pdf

The OPTICS implementation in Weka is essentially unmaintained and just as incomplete. It doesn't actually produce clusters, it only computes the cluster order. For this it makes a duplicate of the database - it isn't really Weka code.

There seems to be a rather extensive implementation available in ELKI in Java by the group that published OPTICS in the first place. You might want to test any other implementation against this "official" version.

jsalter
  • 5,758
  • 2
  • 16
  • 10
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 1
    Indeed, there are a lots of incomplete OPTICS implementations and clones of the Weka version around. You should take the ELKI version as reference. – Erich Schubert Jan 01 '13 at 11:35
  • The relative thresholding is to my mind the point where a relatively clear exposition and approach transition into something much more cloudy, with additional heuristics and hidden parameters. There is probably no way around this, but I definitely feel that the intermediate ordered reachability values constitute a nice result that stands on its own. What happens afterwards is open to different approaches, and the one chosen in this paper is not so self-evident as to be the only one worth considering. – micans Jan 08 '13 at 15:32
  • There are at least two more methods proposed for how to exact clusters from the plot. Yet, without such a cluster extraction method, is it actually a clustering algorithm? At some point, you do want to get clusters out of it, not just a plot. – Has QUIT--Anony-Mousse Jan 08 '13 at 17:13
  • Note that absolute thresholding means you are doing DBSCAN, not OPTICS. Just more complicated and thus slower. – Has QUIT--Anony-Mousse Jan 08 '13 at 17:17
8

EDIT: the following is known to not be a complete implementation of OPTICS.

I did a quick search and found the following (Optics). I can't vouch for its quality, however the algorithm seems pretty simple, so you should be able to validate/adapt it quickly.

Here is a quick example of how to build clusters on the output of the optics algorithm:

def cluster(order, distance, points, threshold):
    ''' Given the output of the options algorithm,
    compute the clusters:

    @param order The order of the points
    @param distance The relative distances of the points
    @param points The actual points
    @param threshold The threshold value to cluster on
    @returns A list of cluster groups
    '''
    clusters = [[]]
    points   = sorted(zip(order, distance, points))
    splits   = ((v > threshold, p) for i,v,p in points)
    for iscluster, point in splits: 
        if iscluster: clusters[-1].append(point)
        elif len(clusters[-1]) > 0: clusters.append([])
    return clusters

    rd, cd, order = optics(points, 4)
    print cluster(order, rd, points, 38.0)
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Bashwork
  • 1,619
  • 8
  • 14
  • Thanks Bashwork but that seems the exact same code as what vartec has suggested. The problem with that is, I couldn't figure out how I would extract the clustering structure (which elements belongs to which cluster) from the output of that algorithm. Please have a look at the 'Note' at the very bottom of my question. – Murat Derya Özen Apr 28 '11 at 22:41
  • So the code gives you the output you need to extract clusters (the order and the reachability distance). If you look at the wikipedia section for extracting clusters, you simply need to walk the ordered results with a threshold on the distances (lower threshold means more clusters). (http://en.wikipedia.org/wiki/OPTICS_algorithm). If this doesn't make sense I can give some example code. – Bashwork Apr 29 '11 at 17:50
  • 1
    I just ran the code you have posted and the result I get with a threshold of 38 is [[31.0, 87.0], [73.0, 9.0]] [[5.0, 8.0]] [[97.0, 9.0]] (3 clusters). I lower the threshold to 10 and there is only 1 cluster. The test data I used is the same as the one used in the link you gave (testX). I would appreciate it if you can correct the code and I'll award your bounty. – Murat Derya Özen Apr 29 '11 at 22:39
  • Based on my understanding of the algorithm, those results are correct as a cluster is created every time the ordered collection descends below the given threshold. In the case of 38, there are three valleys while in the case of 10 there is only one (the zero result). The threshold basically controls what should be considered a valley. – Bashwork May 02 '11 at 13:46
  • 12
    I would take all this here with a grain of salt. It diverts considerably from my understanding of this algorithm. You might want to get the official implementation (not in python) and compare the results. – Has QUIT--Anony-Mousse Feb 08 '12 at 17:42
7

While not technically OPTICS there is an HDBSCAN* implementation for python available at https://github.com/lmcinnes/hdbscan . This is equivalent to OPTICS with an infinite maximal epsilon, and a different cluster extraction method. Since the implementation provides access to the generated cluster hierarchy you can extract clusters from that via more traditional OPTICS methods as well if you would prefer.

Note that despite not limiting the epsilon parameter this implementation still achieves O(n log(n)) performance using kd-tree and ball-tree based minimal spanning tree algorithms, and can handle quite large datasets.

Leland McInnes
  • 316
  • 2
  • 2
6

There now exists the library pyclustering that contains, amongst others, a Python and a C++ implementation of OPTICS.

Thomas
  • 4,696
  • 5
  • 36
  • 71
4

It is now implemented in the development version (scikit-learn v0.21.dev0) of sklearn (a clustering and maschine learning module for python)

here is the link: https://scikit-learn.org/dev/modules/generated/sklearn.cluster.OPTICS.html

1

See "Density-based clustering approaches" on http://www.chemometria.us.edu.pl/index.php?goto=downloads

vartec
  • 131,205
  • 36
  • 218
  • 244
  • 1
    Thanks for the answer vartec, but the implementation seemed incomplete to me. I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster in the list contains a list of (x,y) pairs belonging to that cluster. – Murat Derya Özen Apr 23 '11 at 21:01
1

You want to look at a space-filling-curve or a spatial index. A sfc reduce the 2d complexity to a 1d complexity. You want to look at Nick's hilbert curve quadtree spatial index blog. You want to download my implementation of a sfc at phpclasses.org (hilbert-curve).

Micromega
  • 12,486
  • 7
  • 35
  • 72
  • Thanks epitaph but how exactly does this answer my question? Can you clarify your answer a bit more? – Murat Derya Özen Apr 23 '11 at 21:55
  • A sfc is a clustering algorithm using a fractal. A hilbert curve has a fractal dimension of 2. If you have 2d data you can easily subdivide this data into smaller tiles. Basically it's a reordering. It's like storing them in a quadtree. You can also use an adaptive sfc where emtpy regions are skipped or have a lower granularity of the sfc. Sfc is often used in maps, like google maps. – Micromega Apr 23 '11 at 22:02
  • Sounds good and worth trying. Thank you. But I'm still looking for an OPTICS implementation in Python. – Murat Derya Özen Apr 23 '11 at 23:32
  • No problem, I didn't know about OPTICS, I didn't understand it either. It seems to me it is using a spatial index but I don't understand why. – Micromega Apr 23 '11 at 23:43
  • 2
    With a spatial index, OPTICS can run in `O(n log n)` instead of `O(n^2)`. So for large, indexable data sets, this makes quite a difference. An R*-Tree is probably a better choice than a sfc though. – Has QUIT--Anony-Mousse Feb 09 '12 at 13:23