0

I want to use Apache Commons Math's DBSCANClusterer<T extends Clusterable> to perform a clustering using the DBSCAN algorithm, but with a custom distance metric as my data points contain non-numerical values. This seems to have been easily achievable in the older version (note that the fully qualified name of this class is org.apache.commons.math3.stat.clustering.DBSCANClusterer<T> whereas it is org.apache.commons.math3.ml.clustering.DBSCANClusterer<T> for the current release), which has now been deprecated. In the older version, Clusterable would take a type-param, T, describing the type of the data points being clustered, and the distance between two points would be defined by one's implementation of Clusterable.distanceFrom(T), e.g.:

class MyPoint implements Clusterable<MyPoint> {
    private String someStr = ...;
    private double someDouble = ...;

    @Override
    public double distanceFrom(MyPoint p) {
        // Arbitrary distance metric goes here, e.g.:
        double stringsEqual = this.someStr.equals(p.someStr) ? 0.0 : 10000.0;
        return stringsEqual + Math.sqrt(Math.pow(p.someDouble - this.someDouble, 2.0)); 
    }
}

In the current release, Clusterable is no longer parameterized. This means that one has to come up with a way of representing one's (potentially non-numerical) data points as a double[] and return that representation from getPoint(), e.g.:

class MyPoint implements Clusterable {
    private String someStr = ...;
    private double someDouble = ...;

    @Override
    public double[] getPoint() {
        double[] res = new double[2];
        res[1] = someDouble; // obvious
        res[0] = ...; // some way of representing someStr as a double required
        return res;
    }
}

And then provide an implementation of DistanceMeasure that defines the custom distance function in terms of the double[] representations of the two points being compared, e.g.:

class CustomDistanceMeasure implements DistanceMeasure {
    @Override
    public double compute(double[] a, double[] b) {
        // Let's mimic the distance function from earlier, assuming that
        // a[0] is different from b[0] if the two 'someStr' variables were
        // different when their double representations were created.
        double stringsEqual = a[0] == b[0] ? 0.0 : 10000.0;
        return stringsEqual + Math.sqrt(Math.pow(a[1] - b[1], 2.0));
    }
}

My data points are of the form (integer, integer, string, string):

class MyPoint {
    int i1;
    int i2;
    String str1;
    String str2;
}

And I want to use a distance function/metric that essentially says "if str1 and/or str2 differ for MyPoint mpa and MyPoint mpb, the distance is maximal, otherwise the distance is the Euclidean distance between the integers" as illustrated by the following snippet:

class Dist {
    static double distance(MyPoint mpa, MyPoint mpb) {
        if (!mpa.str1.equals(mpb.str1) || !mpa.str2.equals(mpb.str2)) {
            return Double.MAX_VALUE;
        }
        return Math.sqrt(Math.pow(mpa.i1 - mpb.i1, 2.0) + Math.pow(mpa.i2 - mpb.i2, 2.0));
    }
}

Questions:

  1. How do I represent a String as a double in order to enable the above distance metric in the current release (v3.6.1) of Apache Commons Math? String.hashCode() is insufficient as hash code collisions would cause different strings to be considered equal. This seems like an unsolvable problem as I'm essentially trying to create a unique mapping from an infinite set of strings to a finite set of numerical values (64bit double).
  2. As (1) seems impossible, am I misunderstanding how to use the library? If yes, were did I take a wrong turn?
  3. Is my only alternative to use the deprecated version for this kind of distance metric? If yes, (3a) why would the designers choose to make the library less general? Perhaps in favor of speed? Perhaps to get rid of the self-reference in class MyPoint implements Clusterable<MyPoint> which some might consider bad design? (I realize that this might be too opinionated, so please disregard it if that is the case). For the commons-math experts: (3b) what downsides are there to using the deprecated version other than forward compatibility (the deprecated version will be removed in 4.0)? Is it slower? Perhaps even incorrect?

Note: I am aware of ELKI which is apparently popular among a set of SO users, but it does not fit my needs as it is marketed as a command-line and GUI tool rather than a Java library to be included in third-party applications:

You can even embed ELKI into your application (if you accept the AGPL-3 license), but we currently do not (yet) recommend to do so, because the API is still changing substantially. [...]

ELKI is not designed as embeddable library. It can be used, but it is not designed to be used this way. ELKI has tons of options and functionality, and this comes at a price, both in runtime (although it can easily outperform R and Weka, for example!) memory usage and in particular in code complexity.

ELKI was designed for research in data mining algorithms, not for making them easy to include in arbitrary applications. Instead, if you have a particular problem, you should use ELKI to find out which approach works good, then reimplement that approach in an optimized manner for your problem (maybe even in C++ then, to further reduce memory and runtime).

Janus Varmarken
  • 2,306
  • 3
  • 20
  • 42
  • I am missing what the Strings look like but perhaps you can get the String as an array of bytes and encode them using characters from 0 to 9 (Similar as Base64 encodes bytes). And turn the resulting number into a double. Perhaps assume the number will start with 1. not to lose leading zeroes. – Juan Sep 13 '18 at 03:40
  • @Juan the strings are a mix of IPs in decimal form and hostnames. For IPv4, I could obviously simply just interpret the 32bits as a number and use that directly. Hostnames, however, are more tricky. Another issue is that say `MyPoint.str1` can be a hostname in one data point but an IP in the next data point. – Janus Varmarken Sep 13 '18 at 06:11
  • Using ELKI as a library works fine, I do this sometimes. But the API may break when updating to a new version (but the same happened for Apache to you, a breaking API change). And some classes like the R-tree have many parameters to set, unfortunately. – Has QUIT--Anony-Mousse Sep 14 '18 at 06:02
  • @Anony-Mousse I see. I just didn't get too excited about ELKI as an API as I couldn't really figure out how to use it from browsing the website. The Apache one seems rather straight forward; I might have to go with the deprecated version though. – Janus Varmarken Sep 14 '18 at 06:21
  • The unit tests are a good source of example. – Has QUIT--Anony-Mousse Sep 14 '18 at 18:40
  • In ELKI, you can implement [PrimitiveDistanceFunction](https://elki-project.github.io/releases/release0.7.1/doc/de/lmu/ifi/dbs/elki/distance/distancefunction/PrimitiveDistanceFunction.html) for your own data type (not just double arrays - **ELKI is designed to be extensible** this way!), and use it with [DBSCAN](https://elki-project.github.io/releases/release0.7.1/doc/de/lmu/ifi/dbs/elki/algorithm/clustering/DBSCAN.html). yes, the ELKI API will break for 0.8 (all classes will be moved to a different package), but that happened for Apache, too... so Apache is not more stable than ELKI. – Erich Schubert Sep 16 '18 at 00:24

0 Answers0