0

I have implemented knn algorithm and this is my function to calculate the Euclidian distance.

def euc_dist(self, train, test):
    return math.sqrt(((train[0] - test[0]) ** 2) + ((test[1] - train[1]) ** 2))

#
def euc_distance(self, test):
    eu_dist = []
    for i in range(len(test)):
        distance = [self.euc_dist(self.X_train[j], test[i]) for j in range(len(self.X_train))]
        eu_dist.insert(i, distance)


    return eu_dist

Is there any better efficient way to perform the distance calculation??

Bill Bell
  • 21,021
  • 5
  • 43
  • 58
nirvair
  • 4,001
  • 10
  • 51
  • 85

2 Answers2

1

(1) Python loops are extremely slow. Learn to use array computations, e.g. numpy:

import numpy as np

x = np.array(...)
y = np.array(...)
distances = np.sqrt(np.sum((x-y)**2)) 

Batching the computations allows for efficient vectorized or even parallel implementations.

(2) If you don't need absolute distance values (e.g. you only compare their magnitude or average or normalize the result somehow), then omit square root operation, it is very slow. Omission is possible because sqrt is a monotonic function (i.e. omitting it preserves total order).

squared_distances = np.sum((x-y)**2)

(3) There may be distance definitions other than Euclidian that may be meaningful for your specific problem. You may try to find the definition that is simpler and faster, e.g. a simple subtraction or absolute error.

error = x-y
absolute_error = np.abs(x-y)

(4) In all cases, try and measure (profile). Do not rely on intuition when you deal with runtime performance optimization.

P.S. Code snippets above do not map to your code exactly (on purpose). It is up to you to learn how to adapt them. Hint: 2D arrays ;)

Ivan Aksamentov - Drop
  • 12,860
  • 3
  • 34
  • 61
0

You can use squared distances (just remove math.sqrt - slow operation) if they are needed for comparisons only.

Possible optimization - if Python operation ((train[0] - test[0]) ** 2 uses powering through exponent, it is worth to change it to simple multiplication

def squared_euc_dist(self, train, test):
    x = train[0] - test[0]
    y = train[1] - test[1]
    return x * x + y * y
MBo
  • 77,366
  • 5
  • 53
  • 86
  • 1
    Yes, squaring by multiplication is roughly twice as fast as using `**`. And if the OP needs distance instead of squared distance then `math.hypot` is worth looking at. OTOH, they should probably be using Numpy. – PM 2Ring May 11 '17 at 03:37
  • In this case, neither squaring, nor sqrt are nearly as important as looping and memory access overhead in the interpreter. – Ivan Aksamentov - Drop May 11 '17 at 03:51
  • @Drop Certainly! Which is why I mentioned Numpy. ;) – PM 2Ring May 11 '17 at 11:58