2

I have used the ELKI implementation of DBSCAN to identify fire hot spot clusters from a fire data set and the results look quite good. The data set is spatial and the clusters are based on latitude, longitude. Basically, the DBSCAN parameters identify hot spot regions where there is a high concentration of fire points (defined by density). These are the fire hot spot regions.

My question is, after experimenting with several different parameters and finding a pair that gives a reasonable clustering result, how does one validate the clusters?

Is there a suitable formal validation method for my use case? Or is this subjective depending on the application domain?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194

2 Answers2

3

ELKI contains a number of evaluation functions for clusterings.

Use the -evaluator parameter to enable them, from the evaluation.clustering.internal package.

Some of them will not automatically run because they have quadratic runtime cost - probably more than your clustering algorithm.

I do not trust these measures. They are designed for particular clustering algorithms; and are mostly useful for deciding the k parameter of k-means; not much more than that. If you blindly go by these measures, you end up with useless results most of the time. Also, these measures do not work with noise, with either of the strategies we tried.

The cheapest are the label-based evaluators. These will automatically run, but apparently your data does not have labels (or they are numeric, in which case you need to set the -parser.labelindex parameter accordingly). Personally, I prefer the Adjusted Rand Index to compare the similarity of two clusterings. All of these indexes are sensitive to noise so they don't work too well with DBSCAN, unless your reference has the same concept of noise as DBSCAN.

If you can afford it, a "subjective" evaluation is always best.

You want to solve a problem, not a number. That is the whole point of "data science", being problem oriented and solving the problem, not obsessed with minimizing some random quality number. If the results don't work in reality, you failed.

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42
3

There are different methods to validate a DBSCAN clustering output. Generally we can distinguish between internal and external indices, depending if you have labeled data available or not. For DBSCAN there is a great internal validation indice called DBCV.

External Indices: If you have some labeled data, external indices are great and can demonstrate how well the cluster did vs. the labeled data. One example indice is the RAND indice.https://en.wikipedia.org/wiki/Rand_index

Internal Indices: If you don't have labeled data, then internal indices can be used to give the clustering result a score. In general the indices calculate the distance of points within the cluster and to other clusters and try to give you a score based on the compactness (how close are the points to each other in a cluster?) and separability (how much distance is between the clusters?).

For DBSCAN, there is one great internal validation indice called DBCV by Moulavi et al. Paper is available here: https://epubs.siam.org/doi/pdf/10.1137/1.9781611973440.96 Python package: https://github.com/christopherjenness/DBCV

Julian_Kor
  • 31
  • 2