1

I read to datasets from file into numpy arrays like this:

def read_data(filename):
   data = np.empty(shape=[0, 65], dtype=int)
   with open(filename) as f:
       for line in f:
           data = np.vstack((data, np.array(list(map(int, line.split(','))), dtype=int)))
   return data

I use numpy to calculate the euclidean distance between two lists:

def euclidean_distance(x, z):
   return np.linalg.norm(x-z)

After this, I calculate the euclidean distances like this:

for data in testing_data:
   for data2 in training_data:
       dist = euclidean_distance(data, data2)

My problem is that this code runs very slowly, it takes about ~10 minutes to finish. How can I improve this, what am I missing?
I have to use the distances in another algorith, so the speed is very important.

kmario23
  • 57,311
  • 13
  • 161
  • 150
Fogarasi Norbert
  • 650
  • 2
  • 14
  • 34
  • Use cdist - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html? – Divakar May 06 '19 at 13:34
  • Of course, the complexity here is O(N*M) where `N` and `M` are the size of `testing_data` & `training_data` resp. So, it depends on how big both of these datasets are.. – kmario23 May 06 '19 at 13:53
  • The testing dataset consists of 3823 and the training dataset 1797 data. So this means 6.869.931 distances have to be calculated, which I don't think it's that much that it has to take 10 minutes. – Fogarasi Norbert May 06 '19 at 14:00
  • @FogarasiNorbert Agreed, that's not much! One optimization could be to get rid of the manual creation of `list` and just use `np.fromiter(map(int, line.split(',')))`, although I think it might not give that much improvement. Another thing might be to get rid of the function `euclidean_distance()` and inline the code directly since it's just one line. It might give little boost since we're avoiding 6.8M function calls – kmario23 May 06 '19 at 14:32
  • @Divakar I have used `cdist` as well manual implementation using `numpy.linalg.norm` and didn't observe much difference in terms of speed. – kmario23 May 06 '19 at 14:35
  • Some strange thing happening here. I tried `cdist` and it runs in 2 seconds and if I improve my code as you said @kmario23 it takes the same time as before. I don't want to use `cdist` because that creates a matrix. How is this possible? – Fogarasi Norbert May 06 '19 at 14:40
  • In which way you want to get the output if not a 2d numpy array? – max9111 May 07 '19 at 08:05
  • This distance calculation is part of an algorithm the way I want to use is to give two lists of 64 length as parameters for my euclidean_distance function. – Fogarasi Norbert May 07 '19 at 08:38

1 Answers1

2

You could use sklearn.metrics.pairwise_distances which allows you to allocate the work to all of your cores. Parallel construction of a distance matrix discusses the same topic and provides a good discussion on the differences of pdist, cdist, and pairwise_distances

If I understand your example correctly, you want the distance between each sample in the training set and each sample in the testing set. To do that you could use:

dist = pairwise_distances(training_data, testing_data, n_jobs=-1)
Grr
  • 15,553
  • 7
  • 65
  • 85