5

I am looking at data points that have lat, lng, and date/time of event. One of the algorithms I came across when looking at clustering algorithms was DBSCAN. While it works ok at clustering lat and lng, my concern is it will fall apart when incorporating temporal information, since it's not of the same scale or same type of distance.

What are my options for incorporating temporal data into the DBSCAN algorithm?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
cbake
  • 91
  • 2
  • 5

2 Answers2

6

Look up Generalized DBSCAN by the same authors.

Sander, Jörg; Ester, Martin; Kriegel, Hans-Peter; Xu, Xiaowei (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery (Berlin: Springer-Verlag) 2(2): 169–194. doi:10.1023/A:1009745219419.

For (Generalized) DBSCAN, you need two functions:

  1. findNeighbors - get all "related" objects from your database

  2. corePoint - decide whether this set is enough to start a cluster

then you can repeatedly find neighbors to grow the clusters.

Function 1 is where you want to hook into, for example by using two thresholds: one that is geographic and one that is temporal (i.e. within 100 miles, and within 1 hour).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • ST-DBSCAN is another algorithm that seems able to handle temporal data. It looks like it works in a similar manner (setting two thresholds). – cbake Jun 04 '15 at 16:41
  • My question also similar to the above one, my data set include GPS coordinates and every lat/long value is time stamped. There are five neighborhood functions in ELKI (when I select GDBScan): EpsilonNeighborPredicate, COPACNeighborPredicate, ERiCNeighborPredicate, FourCNeighborPredicate, PreDeConNeighborPredicate. But I am not sure which one to use. Any suggestions? – user1124825 Dec 13 '15 at 17:47
  • Define your own, that accomodates location as *you* want to accomodate location, and that accomodates time as *you* want to accomodate time.h – Has QUIT--Anony-Mousse Dec 13 '15 at 18:45
  • I'm wondering if a scaling (or normalization) of the data, including converting the times maybe to a "seconds since X" vector shouldn't be able to do this properly? – K.-Michael Aye Aug 17 '16 at 13:39
  • It makes much more sense *and* is usable *and* is straightfoward and easy to define two thresholds: at most 10 miles away *and* at most 1 day apart. Instead of mashing things into Euclidean space via scaling, where you end up with a logic like `a * distance^2 + b * timedelta^2 < 100^2`. – Has QUIT--Anony-Mousse Aug 18 '16 at 11:48
2

tl;dr you are going to have to modify your feature set, i.e. scaling your date/time to match the magnitude of your geo data.

DBSCAN's input is simply a vector, and the algorithm itself doesn't know that one dimension (time) is orders of magnitudes bigger or smaller than another (distance). Thus, when calculating the density of data points, the difference in scaling will screw it up.

Now I suppose you can modify the algorithm itself to treat different dimensions differently. This can be done by changing the definition of "distance" between two points, i.e. supplying your own distance function, instead of using the default Euclidean distance.

IMHO, though, the easier thing to do is to scale one of your dimension to match another. just multiply your time values by a fixed, linear factor so they are on the same order of magnitude as the geo values, and you should be good to go.

more generally, this is part of the features selection process, which is arguably the most important part of solving any machine learning algorithm. choose the right features, and transform them correctly, and you'd be more than halfway to a solution.

oxymor0n
  • 1,089
  • 7
  • 15