6

I am fitting a k-nearest neighbors classifier using scikit learn and noticed that the fitting is faster, often by an order of magnitude or more, when using the cosine similarity between two vectors compared to when using the Euclidean similarity. Note that both of these are sklearn built ins; I am not using a custom implementation of either metric.

What is the reason behind such a big discrepancy? I know scikit learn uses either a Ball tree or KD tree to compute the neighbor graph, but I'm not sure why the form of the metric would affect the run time of the algorithm.

To quantify the effect, I performed a simulation experiment in which I fit a KNN to random data using either the euclidean or cosine metric, and recorded the run time in each case. The average run times in each case are shown below:

import numpy as np
import time
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
res=[]
n_trials=10
for trial_id in range(n_trials):
    for n_pts in [100,300,1000,3000,10000,30000,100000]:
        for metric in ['cosine','euclidean']:
            knn=KNeighborsClassifier(n_neighbors=20,metric=metric)
            X=np.random.randn(n_pts,100)
            labs=np.random.choice(2,n_pts)
            starttime=time.time()
            knn.fit(X,labs)
            elapsed=time.time()-starttime
            res.append([elapsed,n_pts,metric,trial_id])

res=pd.DataFrame(res,columns=['time','size','metric','trial'])
av_times=pd.pivot_table(res,index='size',columns='metric',values='time')
print(av_times)

enter image description here

Edit: These results are from a MacBook with version 0.21.3 of sklearn. I also duplicated the effect on a Ubuntu desktop machine with sklearn version 0.23.2.

Simon Segert
  • 421
  • 1
  • 7
  • 23
  • 4
    I've just run your code multiple times on [replit](https://replit.com/@aminnejad/KNN-Distance-Comparison) and I can't see any significant difference using `sklearn==0.24.2`. If you are using the same version, it may be something to do with your local machine – amin_nejad May 25 '21 at 17:12
  • 2
    @amin_nejad, very interesting. I also tried a different machine with version 23.2, and got an effect similar to in my question. Looking in the changelog there were a few changes to the KNN class since version 23.2 but nothing that seems obviously relevant: https://scikit-learn.org/stable/whats_new/v0.24.html#version-0-24-0 – Simon Segert May 25 '21 at 19:22
  • 2
    (I wasted minutes chasing "S-Euclidean", but there seems to be [sqeuclidean in scipy.spatial.distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.sqeuclidean.html) - no sparse matrices, and I'm confused whether it can be used in `KNeighborsClassifier()`. Another idea is to fix the algorithm for timing analysis purposes, starting with *brute*.) – greybeard May 26 '21 at 07:59
  • 2
    Even though `euclidean` and `cosine` aren't explicitly mentioned, it must be something to do with the [changes](https://scikit-learn.org/stable/whats_new/v0.24.html#sklearn-neighbors) listed in the changelog under `sklearn.neighbors` – amin_nejad May 26 '21 at 11:26

4 Answers4

3

Based on the comments I tried running the code with algorithm='brute' in the KNN and the Euclidean times sped up to match the cosine times. But trying algorithm='kd_tree'and algorithm='ball_tree' both throw errors, since apparently these algorithms do not accept cosine distance. So it looks like when the classifier is fit in algorithm='auto' mode, that it defaults to the brute force algorithm for a cosine metric, whereas for Euclidean distance it uses one of the other algorithms. Looking at the changelog, the difference between versions 0.23.2 and 0.24.2 presumably comes down to the following item:

neighbors.NeighborsBase benefits of an improved algorithm = 'auto' heuristic. In addition to the previous set of rules, now, when the number of features exceeds 15, brute is selected, assuming the data intrinsic dimensionality is too high for tree-based methods.

So it seems like the difference between the two did not have to do with the metric, but rather with the performance of a tree-based vs. a brute force search in high dimensions. For sufficiently high dimensions, tree-based searches may fail to outperform linear searches, so the runtime will be slower overall due to the additional overhead required to construct the data structure. In this case, the implmentation was forced to use the faster brute-force search in the cosine case because the tree-based algorithms do not work with cosine distance, but it (suboptimally) picked a tree-based algorithm in the Euclidean case. Looks like this behavior has been noticed and corrected in the latest version.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Simon Segert
  • 421
  • 1
  • 7
  • 23
  • If you quote from the documentation (or from anywhere else available online), please include also the relevant link (edited to add). – desertnaut May 28 '21 at 09:03
1

As pointed out by @igrinis, this is no longer an issue in the latest stable version of scikit-learn (0.24.1). Regardless, I think what I'm about to write could be a contributing factor.

According to the documentation:

  1. metric=euclidean measures distances using sqrt(sum((x - y)^2))
  2. metric=cosine measures distances using this formula.

As you can see, there are no square roots in metric=cosine, which could be the reason why the fitting time is much longer with the first option.

If you want to speed things up even further, you could consider a linear kernel, which may yield the same results as cosine, but will fit even faster because the denominator is not involved (meaning there are no divisions).

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76
  • 2
    "this formula" the formula is missing. Besides this, the computation of a square root should be pretty fast on modern processors (eg. [a square root can be computed every 6 cycles on Skylake-based processors](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#!=undefined&text=sqrt&techs=SSE,SSE2&expand=5365,5386,5386) and 4 every 12 cycles with AVX SIMD instructions). This hardly justify the x55 slowdown, and especially the fact that this slowdown is increasing with the input size... I guess the algorithm is not the same for both metrics (but maybe due to math tricks based on that). – Jérôme Richard May 23 '21 at 17:41
  • 1
    My bad! I updated the answer. The link was inside the code snippet. And yes, it could be that the algorithm is different altogether. – Arturo Sbr May 23 '21 at 18:44
  • 1
    Interesting hypothesis, but I don't think this is the reason. I just re-ran my experiment, but also including the "manhattan" distance option, and the run times there were essentially the same as for the euclidean. Note that the definition of the Manhattan distance does not include a square root. – Simon Segert May 23 '21 at 19:50
  • 2
    (`no square [root] in metric=cosine, which could be the reason` why not have a metric *sum of squares*? No monotone function of a single value changes order.)(errm - *always plural* only applies to poetry? English…) – greybeard May 24 '21 at 04:31
1

I've run your code snippet on Mac, sklearn 0.24.1, got :

metric    cosine  euclidean
size                       
100     0.000322   0.000165
300     0.000205   0.000186
1000    0.000273   0.000271
3000    0.000503   0.000531
10000   0.001459   0.001326
30000   0.002919   0.002784
100000  0.008977   0.008872

So it's probably an implementation issue that got fixed in v0.24.

igrinis
  • 12,398
  • 20
  • 45
-1

The short answer resides in the fact that in order to compute a square root, present in euclidean distance, the computer needs to do a mathematical series sum which results in many operations, meanwhile the cosine distance can be computed directly, with only 4 operations.

  • `the computer needs to do a mathematical series sum which results in many operations`frankly: [no](https://en.m.wikipedia.org/wiki/Root-finding_algorithms). – greybeard Jun 01 '21 at 15:14
  • this idea has already been proposed in the answer of Arturo Sbr (and refuted in the comments to that answer) – Simon Segert Jun 01 '21 at 15:22