1

I am trying to calculate a DTW distance matrix which will look into 150,000 time series each having between 13 to 24 observations - that is the produced distance matrix will be a list of the size of approximately (150,000 x 150,000)/2= 11,250,000,000.

I am running this over a big data cluster of the size of 200GB but I am getting the memory error.

I am using dtaidisatance library and used the distance_matrix_fast function that I could pass on the entire time series at once in a list, but I was getting similar memory error but coming out of the package. the error was thrown straight away as soon as I run it though. I also used the block function in the package, but seems like it is not able to take it all the time series at once to start with.

So I decided to go through a loop and calculate the distance between every two pair of time series and then append it to a list. However I do get the same memory error again as following after running for a long while:

File "/root/anaconda2/test/final_clustering_2.py", line 93, in distance_matrix_scaled.append(dtw.distance_fast(Series_scaled[i], Series_scaled[j])) MemoryError

this is my code below:

distance_matrix_scaled = []

m=len(Series_scaled)
#m=100000
for i in range(0, m - 1):
    for j in range(i + 1, m):

distance_matrix_scaled.append(dtw.distance_fast(Series_scaled[i], Series_scaled[j]))

# save it to the disk
np.save('distance_entire',  distance_matrix_scaled)

Could you please help to answer why am I getting this memory error? is it the python list limit or my cluster size causing this? Is there a clever way or format in numpy I could use to navigate this problem?

RomRom
  • 302
  • 1
  • 11
  • 1
    You can also look at the triarray https://pypi.org/project/triarray/ – anishtain4 Oct 23 '18 at 23:05
  • 1
    If you're open to using `dask`, you could use its distributed arrays (http://docs.dask.org/en/latest/array.html). These are fundamentally a collection of distributed `numpy` arrays that you can index just like you normally would. If you do this, you can simply do some clever broadcasting to calculate your distance matrix. – PMende Oct 23 '18 at 23:31

2 Answers2

2

Your double for loop has 4999950000 iterations. You're appending that many times to a list. Still weird?

If the distance is a scalar then you could indeed save memory by pre-allocating an array that large (and hoping for the best, memory-wise):

import numpy as np

m = 100000
distances = np.empty(m*(m-1)/2) # size: 4999950000
k = 0
for i in range(0, m - 1):
    for j in range(i + 1, m):
         distances[k] = dtw.distance_fast(Series_scaled[i], Series_scaled[j])
         k += 1

Since numpy arrays occupy a contiguous block of memory they are way more efficient at large scale than native lists. There is literally negligible overhead from the array itself so you mostly get the size of your actual data.

If this huge array doesn't fit in your memory, you're in trouble. You'd have to cut up your data and work in smaller chunks. Or maybe use some kind of memory mapping.

A minor note, however: the array we're populating (assuming 64-bit doubles) occupies roughly 37 GB of RAM. That's...a lot. and if you can fit that in your memory, you'll have to wait 5 billion iterations of a python (double) loop. This will take...a lot of time. Don't hold your breath.

  • Thanks Andras- before writing the loop, I used the datidistance matrix function directly, which It would return a distance matrix- the memory problem appeard to be coming from insdie the package- so I replaced that with the loop and appended them to the list- I will try your apporach and hopefully it will work. – RomRom Oct 23 '18 at 23:06
  • @RomRom I see. Updated my answer with some numbers... 37 GB of memory even with a contiguous block of memory (unless I'm mistaken), and 5e9 native python iterations. You should time summing just 0 to see how long the loop itself takes with a trivial sum, then multiply that with how long a single distance calculation usually takes. I'm afraid you'll get completely crazy runtimes. – Andras Deak -- Слава Україні Oct 23 '18 at 23:19
  • yes I did time the loop before deciding to do it. it takes about 30hr to process the entire data, and for 1000 cases is 4 seconds- it is time consuming but unfortunately there's no other way around it. I also calculated the space the entire distance matrix would take and for the entire 150,000 I arrived to around 99GB- so i actually moved my work to a cluster of a size of 6TiB.. hopefully it will run. thanks for your help and comments . – RomRom Oct 23 '18 at 23:32
2

If you are computing something like euclidean distance, look at the memory cost of your computation task, you will generate a intermediate temporary array of size 150000*149999/2*(13~24) where

  • The 150000*149999/2 represents number of unordered pairs among 150000 time series (excluding self-self pair)

  • The 13~24 represents the difference between two time series vector, which will be normed later and reduce to one number per pair.

Each number is typically a float or double which is 4 or 8 byte. Therefore, the algorithm would take 1T ~ 4T memory, which obviously will make 200G explode.

Several tricks are available to reduce memory cost, besides manually dividing into smaller tasks:

  • If you insist with numpy, definitely choose smaller dtypes such as float32. If your number is small, you might even consider things like int16 or int8. Do not use float16 as it has no computation support on CPU (extremely slow).

  • If that is not the case, you might consider numba, which allows you to compile python loop to highly efficient CPU code and make it run over all cores, which should be the optimal solution on CPU (it won't need to generate temporary array).

  • Scipy also have scipy.spatial.distance.pdict. Not exactly sure how will it perform in terms of memory, but you can give it a try.

ZisIsNotZis
  • 1,570
  • 1
  • 13
  • 30
  • @ZislsNotZis thanks for your comments- I don't think your estimated Memory of 1T or 4T is correct- based on my estimation it will be roughly 90GB. unfortunately I can not use int, as I am calcuating a particular distance and also not euclidean. that will remain only your option 2 which I'll be looking into numba and see how it works. – RomRom Oct 24 '18 at 05:00
  • @romrom Oh I thought it's something like euclidean, sorry for that. Any way, your result of 90GB should be the memory usage of the final result (of double type), while I believe if done in a broadcast-ed way, the temporary memory cost might be way bigger than 90GB – ZisIsNotZis Oct 24 '18 at 05:24