4

I am not very knowledgeable on time based clustering and wondering if any algorithms are well suited for my use case.

I have a set of exertion data (range from 0-500) and I want to cluster them along time intervals.

My problem is that I want to find point the points of time where there is major exertion differences on the time interval. I will know exactly how many grouping their should be (e.g. 5 separate clusters) but wont know where one ends and the next one starts.

Is there a good algorithm to apply in this case? I was looking at K-Means but it appears to be very good at clustering disregarding the time and I am more looking for the boundaries looking at exertion data.

Kromster
  • 7,181
  • 7
  • 63
  • 111
mornindew
  • 1,993
  • 6
  • 32
  • 54

1 Answers1

1

I think you could get good results from a dynamic program. For each interval [i, j), let C(i, j) be a loss function that is lower when the interval values are more likely to be one cluster. Then letting L(k, r) be the minimum loss for up to k clusters of elements [0, r), we have equations

L(1, r) = C(0, r)
L(k, r), k > 1 = min over s in [0, r) of L(k-1, s) + C(s, r).

If there are O(1) values of k needed, evaluating these equations with memoization takes O(n^2) time and O(n) space where n is the number of samples.

A plausible first choice for C(i, j) would be the statistical variance of the samples in that interval. Naively, this requires Theta(n^3) time to compute for each interval, but Welford's algorithm can be used to compute variance online if you iterate s from its greatest value to its least, so the overall algorithm would still be O(n^2).

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120