Why is Scipy.spatial.distance.cdist fast if it uses a double for loop?

Question

I've been told that double for loops are slow in Python. However, while digging into the implementation of Scipy.spatial.distance.cdist(), I found that it's just a double for loop and not some clever numpy trick. And yet, many answers (e.g. https://stackoverflow.com/a/62416370) recommend using cdist().

My question is: why is the function still so fast if it's just naively iterating over all elements in O(n^2)?

You are right is `O(n^2)` that being said **is fast because it uses the numpy/scipy implementations of the metrics** (this is where the bulk of the cost is). So is preferable to use that, in general, to a custom implementation. — Dani Mesejo, Oct 11 '21 at 19:54
The speed of numpy methods isn't because of some special trick that reduces the `O`. It's because the iterations are done efficiently in compiled code. A competent `C` programmer could write equally fast code. It's not that numpy is so fast, but that python level loops are slow. — hpaulj, Oct 11 '21 at 20:05

score 1 · Accepted Answer · answered Oct 11 '21 at 21:02

Function with two for loops that you are pointing to, _cdist_callable(), is not used in most cases. When you read cdist() function you can see that _cdist_callable() is run for example when you provide your own metric function.

In typical scenario, when you provide metric in form of a string: euclidean, chebyshev, cityblock, etc., C-optimized functions are being used instead. And "handles" to those C-optimized functions for those metrics are listed here, in the same file.

Why is Scipy.spatial.distance.cdist fast if it uses a double for loop?

1 Answers1