1

I have a list of time series data which contain of 1977 customers data. Each of them show 17,544 data points (hourly data for 2 years). I try to identify their cluster number and group them into same clusters. This code below is my programing, where a variable name "list of list" is a list that converting a timeseries data values into list of individual customers.

from dtw import dtw
import numpy as np
# Define a custom distance function
def my_dist(x, y):
   return np.abs(x - y) 
#Compute the pairwise DTW distances`
distances = np.zeros((len(list_of_lists), len(list_of_lists)))
for i in range(len(list_of_lists)):
  for j in range(i+1, len(list_of_lists)):
      x = list_of_lists[i]
      y = list_of_lists[j]
        distance, *rest = dtw(x, y, dist=my_dist)
        distances[i,j] = distance
        distances[j,i] = distance
from sklearn.cluster import AgglomerativeClustering
# Perform hierarchical clustering on the distance matrix
clustering = AgglomerativeClustering(n_clusters=cluster_count, metric = None, 
            linkage='average')
labels = clustering.fit_predict(distances)
print(labels)

However, my programming consume a lot of computation time.

Thus, are there any way to create an programming which can minimize a computation time to complete my task ?

2 Answers2

1

You should use a warping window of about 5% https://www.cs.unm.edu/~mueen/DTW.pdf That will make it about 20 times faster (and generally more accurate)

Lets assume your data is oversampled. If you downsample it 1 in 2, DTW becomes 4 times faster. If you downsample it 1 in 3, DTW becomes 9 times faster. ... If you downsample it 1 in 5, DTW becomes 25 times faster.

You can use both those ideas together, and you are 100 times faster. There are other tricks if that is not enough https://www.cs.unm.edu/~mueen/DTW.pdf

AlexK
  • 2,855
  • 9
  • 16
  • 27
0

You can try parallelizing the distance calculation. Also the fastdtw library has a faster implemention.

from fastdtw import fastdtw
import numpy as np
from joblib import Parallel, delayed
from sklearn.cluster import AgglomerativeClustering

def my_dist(x, y):
    return np.abs(x - y)

def calculate_distance(i, list_of_lists):
    distances = np.zeros(len(list_of_lists))
    x = list_of_lists[i]
    for j in range(i+1, len(list_of_lists)):
        y = list_of_lists[j]
        distance, _ = fastdtw(x, y, dist=my_dist)
        distances[j] = distance
    return distances

# Compute the pairwise DTW distances using parallel processing
n_jobs = -1  # Use all available CPU cores
distances = np.zeros((len(list_of_lists), len(list_of_lists)))
results = Parallel(n_jobs=n_jobs)(delayed(calculate_distance)(i, list_of_lists) for i in range(len(list_of_lists)))

for i, row in enumerate(results):
    distances[i, i+1:] = row[i+1:]
    distances[i+1:, i] = row[i+1:]

# Perform hierarchical clustering on the distance matrix
clustering = AgglomerativeClustering(n_clusters=cluster_count, metric=None, linkage='average')
labels = clustering.fit_predict(distances)
print(labels)
dlPFC
  • 11
  • 2