Calculating euclidean distances with Python runs too slow

Question

I read to datasets from file into numpy arrays like this:

def read_data(filename):
   data = np.empty(shape=[0, 65], dtype=int)
   with open(filename) as f:
       for line in f:
           data = np.vstack((data, np.array(list(map(int, line.split(','))), dtype=int)))
   return data

I use numpy to calculate the euclidean distance between two lists:

def euclidean_distance(x, z):
   return np.linalg.norm(x-z)

After this, I calculate the euclidean distances like this:

for data in testing_data:
   for data2 in training_data:
       dist = euclidean_distance(data, data2)

My problem is that this code runs very slowly, it takes about ~10 minutes to finish. How can I improve this, what am I missing?
I have to use the distances in another algorith, so the speed is very important.

Use cdist - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html? — Divakar, May 06 '19 at 13:34
Of course, the complexity here is O(N*M) where `N` and `M` are the size of `testing_data` & `training_data` resp. So, it depends on how big both of these datasets are.. — kmario23, May 06 '19 at 13:53
The testing dataset consists of 3823 and the training dataset 1797 data. So this means 6.869.931 distances have to be calculated, which I don't think it's that much that it has to take 10 minutes. — Fogarasi Norbert, May 06 '19 at 14:00
@FogarasiNorbert Agreed, that's not much! One optimization could be to get rid of the manual creation of `list` and just use `np.fromiter(map(int, line.split(',')))`, although I think it might not give that much improvement. Another thing might be to get rid of the function `euclidean_distance()` and inline the code directly since it's just one line. It might give little boost since we're avoiding 6.8M function calls — kmario23, May 06 '19 at 14:32
@Divakar I have used `cdist` as well manual implementation using `numpy.linalg.norm` and didn't observe much difference in terms of speed. — kmario23, May 06 '19 at 14:35
Some strange thing happening here. I tried `cdist` and it runs in 2 seconds and if I improve my code as you said @kmario23 it takes the same time as before. I don't want to use `cdist` because that creates a matrix. How is this possible? — Fogarasi Norbert, May 06 '19 at 14:40
In which way you want to get the output if not a 2d numpy array? — max9111, May 07 '19 at 08:05
This distance calculation is part of an algorithm the way I want to use is to give two lists of 64 length as parameters for my euclidean_distance function. — Fogarasi Norbert, May 07 '19 at 08:38

Grr · Answer 1 · 2019-05-06T14:50:45.317

You could use sklearn.metrics.pairwise_distances which allows you to allocate the work to all of your cores. Parallel construction of a distance matrix discusses the same topic and provides a good discussion on the differences of pdist, cdist, and pairwise_distances

If I understand your example correctly, you want the distance between each sample in the training set and each sample in the testing set. To do that you could use:

dist = pairwise_distances(training_data, testing_data, n_jobs=-1)

Calculating euclidean distances with Python runs too slow

1 Answers1