Time Series Clustering in Python

Question

I have one column that corresponds to the item and the following columns correspond to timestamps. In every column corresponding to the timestamps we have the number of sales of each item. This is just an example of my dataframe. I have hundreds of rows and hundreds of timestamp columns.

d = {'item': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 
     '2019-07-25 17:00:00': [0, 2, 3, 5, 6, 7, 0, 1, 9 , 10],
     '2019-07-26 8:00:00': [0, 2, 3, 0, 3, 5, 0, 1, 9 , 10],
     '2019-07-26 16:00:00': [0, 1, 3, 5, 6, 7, 0, 2, 9 , 1],
    '2019-07-27 21:00:00': [0, 2, 3, 5, 3, 7, 0, 1, 4 , 10]}

df = pd.DataFrame(d)

df

After this I created a train and test dataset and applied the kshape algorithm

from tslearn.utils import to_time_series_dataset
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from sklearn.model_selection import train_test_split
from tslearn.clustering import KShape
from sklearn.metrics import adjusted_rand_score

data_train = df.iloc[:3,:]
data_test = df.iloc[:3,:]

data_joined = np.concatenate((data_train, data_test), axis = 0)

# separate by train and test data
data_train, data_test = train_test_split(data_joined, test_size = 0.2, random_state = 888)

# transform to timeseries
X_train = to_time_series_dataset(data_train[:, 1:])
X_test = to_time_series_dataset(data_test[:, 1:])


# y train and y test
y_train = data_train[:, 0].astype(np.int)
y_test = data_test[:, 0].astype(np.int)


# scale X_train and X_test
X_train = TimeSeriesScalerMeanVariance(mu=0, std = 1).fit_transform(X_train)
X_test = TimeSeriesScalerMeanVariance(mu=0, std = 1).fit_transform(X_test)

# applied the algorithm
ks = KShape(n_clusters = 3, max_iter = 100, n_init = 100, verbose = 0, random_state = 888)

# fitted the algorithm
ks.fit(X_train)
preds = ks.predict(X_train)

# get the adjusted_rand_score
adjusted_rand_score(y_train, preds)

But the adjusted rand score was 0. What am I doing something wrong?

I can't reproduce your code. What versions of tslearn and numba are you using? — Itamar Mushkin, Mar 03 '20 at 13:00
from tslearn.utils import to_time_series_dataset from tslearn.preprocessing import TimeSeriesScalerMeanVariance from sklearn.model_selection import train_test_split from tslearn.clustering import KShape from sklearn.metrics import adjusted_rand_score Can you reproduce it with these packages plus pandas and numpy. Thanks. — dante, Mar 03 '20 at 13:18
As a rule, If you want to add code to your question, do it by editing, not by commenting. Also, I used exactly those imports - I'm the one who suggested the edit to add them to your question. My problem was with the numba version on my end, this is why I asked for version. — Itamar Mushkin, Mar 03 '20 at 14:38
but i can run the analysis without numba. Is that the reason for my output? — dante, Mar 03 '20 at 14:46

Time Series Clustering in Python

0 Answers0