DBSCAN using spatial and temporal data

Question

I am looking at data points that have lat, lng, and date/time of event. One of the algorithms I came across when looking at clustering algorithms was DBSCAN. While it works ok at clustering lat and lng, my concern is it will fall apart when incorporating temporal information, since it's not of the same scale or same type of distance.

What are my options for incorporating temporal data into the DBSCAN algorithm?

score 6 · Answer 1 · answered Jun 04 '15 at 08:58

6

Look up Generalized DBSCAN by the same authors.

Sander, Jörg; Ester, Martin; Kriegel, Hans-Peter; Xu, Xiaowei (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery (Berlin: Springer-Verlag) 2(2): 169–194. doi:10.1023/A:1009745219419.

For (Generalized) DBSCAN, you need two functions:

findNeighbors - get all "related" objects from your database
corePoint - decide whether this set is enough to start a cluster

then you can repeatedly find neighbors to grow the clusters.

Function 1 is where you want to hook into, for example by using two thresholds: one that is geographic and one that is temporal (i.e. within 100 miles, and within 1 hour).

answered Jun 04 '15 at 08:58

Has QUIT--Anony-Mousse

76,138
12
138
194

ST-DBSCAN is another algorithm that seems able to handle temporal data. It looks like it works in a similar manner (setting two thresholds). – cbake Jun 04 '15 at 16:41
My question also similar to the above one, my data set include GPS coordinates and every lat/long value is time stamped. There are five neighborhood functions in ELKI (when I select GDBScan): EpsilonNeighborPredicate, COPACNeighborPredicate, ERiCNeighborPredicate, FourCNeighborPredicate, PreDeConNeighborPredicate. But I am not sure which one to use. Any suggestions? – user1124825 Dec 13 '15 at 17:47
Define your own, that accomodates location as *you* want to accomodate location, and that accomodates time as *you* want to accomodate time.h – Has QUIT--Anony-Mousse Dec 13 '15 at 18:45
I'm wondering if a scaling (or normalization) of the data, including converting the times maybe to a "seconds since X" vector shouldn't be able to do this properly? – K.-Michael Aye Aug 17 '16 at 13:39
It makes much more sense *and* is usable *and* is straightfoward and easy to define two thresholds: at most 10 miles away *and* at most 1 day apart. Instead of mashing things into Euclidean space via scaling, where you end up with a logic like `a * distance^2 + b * timedelta^2 < 100^2`. – Has QUIT--Anony-Mousse Aug 18 '16 at 11:48

score 2 · Answer 2 · answered Jun 02 '15 at 18:35

2

tl;dr you are going to have to modify your feature set, i.e. scaling your date/time to match the magnitude of your geo data.

DBSCAN's input is simply a vector, and the algorithm itself doesn't know that one dimension (time) is orders of magnitudes bigger or smaller than another (distance). Thus, when calculating the density of data points, the difference in scaling will screw it up.

Now I suppose you can modify the algorithm itself to treat different dimensions differently. This can be done by changing the definition of "distance" between two points, i.e. supplying your own distance function, instead of using the default Euclidean distance.

IMHO, though, the easier thing to do is to scale one of your dimension to match another. just multiply your time values by a fixed, linear factor so they are on the same order of magnitude as the geo values, and you should be good to go.

more generally, this is part of the features selection process, which is arguably the most important part of solving any machine learning algorithm. choose the right features, and transform them correctly, and you'd be more than halfway to a solution.

answered Jun 02 '15 at 18:35

oxymor0n

1,089
7
15

DBSCAN does not need the data to be in vector form. – Has QUIT--Anony-Mousse Jun 04 '15 at 08:53
@Anony-Mousse what do you mean? How else would you represent the data? – oxymor0n Jun 05 '15 at 04:42
Matrixes, tensors, unstructured text (not in tf-idf VSM). There isn't a limitation on the input data for DBSCAN. – Has QUIT--Anony-Mousse Jun 05 '15 at 06:04
@Anony-Mousse Thanks. I've always been under the impression that DBSCAN calculates the density of points in n-dimensional space, and thus only accepts vectors. Do you have any example of it taking unstructured text as input? Code, papers etc? That's something I'm using DBSCAN myself for. – oxymor0n Jun 05 '15 at 14:23
Maybe in the reference in my answer to this question. – Has QUIT--Anony-Mousse Jun 05 '15 at 16:48

DBSCAN using spatial and temporal data

2 Answers2