I am trying to calculate a DTW distance matrix which will look into 150,000 time series each having between 13 to 24 observations - that is the produced distance matrix will be a list of the size of approximately (150,000 x 150,000)/2= 11,250,000,000.
I am running this over a big data cluster of the size of 200GB but I am getting the memory error.
I am using dtaidisatance library and used the distance_matrix_fast function that I could pass on the entire time series at once in a list, but I was getting similar memory error but coming out of the package. the error was thrown straight away as soon as I run it though. I also used the block function in the package, but seems like it is not able to take it all the time series at once to start with.
So I decided to go through a loop and calculate the distance between every two pair of time series and then append it to a list. However I do get the same memory error again as following after running for a long while:
File "/root/anaconda2/test/final_clustering_2.py", line 93, in distance_matrix_scaled.append(dtw.distance_fast(Series_scaled[i], Series_scaled[j])) MemoryError
this is my code below:
distance_matrix_scaled = []
m=len(Series_scaled)
#m=100000
for i in range(0, m - 1):
for j in range(i + 1, m):
distance_matrix_scaled.append(dtw.distance_fast(Series_scaled[i], Series_scaled[j]))
# save it to the disk
np.save('distance_entire', distance_matrix_scaled)
Could you please help to answer why am I getting this memory error? is it the python list limit or my cluster size causing this? Is there a clever way or format in numpy I could use to navigate this problem?