8

Should the input to sklearn.clustering.DBSCAN be pre-processeed?

In the example http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py the distances between the input samples X are calculated and normalized:

D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
db = DBSCAN(eps=0.95, min_samples=10).fit(S)

In another example for v0.14 (http://jaquesgrobler.github.io/online-sklearn-build/auto_examples/cluster/plot_dbscan.html) some scaling is done:

X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)

I base my code on the latter example and have the impression clustering works better with this scaling. However, this scaling "Standardizes features by removing the mean and scaling to unit variance". I try to find 2d clusters. If I have my clusters distributed in a squared area - let's say 100x100 I see no problem in the scaling. However, if the are distributed in an rectangled area e.g. 800x200 the scaling 'squeezes' my samples and changes the relative distances between them in one dimension. This deteriorates the clustering, doesn't it? Or am I understanding sth. wrong? Do I need to apply some preprocessing at all, or can I simply input my 'raw' data?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Alex
  • 267
  • 1
  • 2
  • 7

1 Answers1

15

It depends on what you are trying to do.

If you run DBSCAN on geographic data, and distances are in meters, you probably don't want to normalize anything, but set your epsilon threshold in meters, too.

And yes, in particular a non-uniform scaling does distort distances. While a non-distorting scaling is equivalent to just using a different epsilon value!

Note that in the first example, apparently a similarity and not a distance matrix is processed. S = (1 - D / np.max(D)) is a heuristic to convert a similarity matrix into a dissimilarity matrix. Epsilon 0.95 then effectively means at most "0.05 of the maximum dissimilarity observed". An alternate version that should yield the same result is:

D = distance.squareform(distance.pdist(X))
S = np.max(D) - D
db = DBSCAN(eps=0.95 * np.max(D), min_samples=10).fit(S)

Whereas in the second example, fit(X) actually processes the raw input data, and not a distance matrix. IMHO that is an ugly hack, to overload the method this way. It's convenient, but it leads to misunderstandings and maybe even incorrect usage sometimes.

Overall, I would not take sklearn's DBSCAN as a referene. The whole API seems to be heavily driven by classification, not by clustering. Usually, you don't fit a clustering, you do that for supervised methods only. Plus, sklearn currently does not use indexes for acceleration, and needs O(n^2) memory (which DBSCAN usually would not).

In general, you need to make sure that your distance works. If your distance function doesn't work no distance-based algorithm will produce the desired results. On some data sets, naive distances such as Euclidean work better when you first normalize your data. On other data sets, you have a good understanding on what distance is (e.g. geographic data. Doing a standardization on this obivously does not make sense, nor does Euclidean distance!)

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thank you very much for your fast reply. I like to identify blinking light sources that might move around randomly which leads to a Gaussian smearing. In addition I have noise overlaid. Currently I'm ignoring the blink intensities and just feed in the 2d positions of blink events. So I think Euclidean distance is OK? From your answer I understand in my case I don't have to pre-process the data (which is positions in nm). But how about the sklearn implementation? Does it actually need similarities as input or can I just give it the positions and it applies the Euclidean distance measure itself? – Alex Jul 04 '13 at 08:04
  • If you have pixels that are equally spaced on x and y, then don't normalize and use Euclidean. As for sklearn, you'll have to dig through the documentation and source code. I believe if you feed it raw data, it will compute an Euclidean distance matrix on its own. (But NOT use indexes for acceleration. Try ELKI, it should be a lot faster with indexes). – Has QUIT--Anony-Mousse Jul 04 '13 at 08:17
  • Ok, thanks allot. I'll have a look at ELKI and dig myself through the sklearn docs. – Alex Jul 05 '13 at 07:11