How to do KMeans clustering with timeseries as a feature

Question

Lets say I have the following dataframe, with continuous data at fixed intervals (so am not sure the tslearn KMeans clustering package is useful for this)

date                                value
2022-09-06 01:40:50.999059          0.2732
2022-09-05 19:55:02.242936          0.9771
.
.
.

I am trying to use the K means algorithm to cluster this but cannot use

df.date = pd.to_datetime(df.date)
data = df[["date","value"]]
model = KMeans(init="random",n_clusters=k,n_init=10,max_iter=300,random_state=4)
model.fit(data)

because I think the KMeans algorithm requires a float. How would I be able to use date as a feature in the KMeans algorithm?

Error:

TypeError: The DType <class 'numpy.dtype[datetime64]'> could not be promoted by <class 'numpy.dtype[float64]'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtype[datetime64]'>, <class 'numpy.dtype[float64]'>)

You think kmeans requires a float. Have you verified that assumption in any way? What alternatives are you looking for? For that matter, what is your actual question? — Mad Physicist, Nov 15 '22 at 19:49
You have 2D data with columns `["date","value"]`. This is essentially your ordinary time-series plot, with `date` on the horizontal axis and `value` on the vertical one. You could simply replace dates with integers `[1,2,3,4,...]` or [convert dates to timestamps](https://stackoverflow.com/questions/40881876/python-pandas-convert-datetime-to-timestamp-effectively-through-dt-accessor). — ForceBru, Nov 15 '22 at 19:52
Added the error and I am looking for how to fit the model with date data — Dylan, Nov 15 '22 at 19:52

score 1 · Answer 1 · answered Nov 15 '22 at 20:09

One solution is to convert your datetime to UTC timestamp. Which is basically the number of seconds passed since Jan 1st 1970 (https://en.wikipedia.org/wiki/Unix_time). This way your data will be shaped as integers.

You can do it like this:

df["stamp"] = df["date"].values.astype(np.int64) // 10 ** 9

The output of .astype(np.int64) will be in ns, thus dividing by 10 ** 9 to convert to seconds.

How to do KMeans clustering with timeseries as a feature

1 Answers1