Euclidean distance: results are different between python and numpy with large number of instances

Question

I am trying two methods to implement the square result of euclidean distance.

By Numpy:

def inference(feature_list):
    distances = np.zeros(len(feature_list))
    for idx, pair in enumerate(feature_list):
        distances[idx] = euclidean_distances(pair[0].reshape((1, -1)), pair[1].reshape((1, -1))).item()
        distances[idx] = distances[idx] * distances[idx]
    return distances

By python:

def inference1(feature_list):
    distances = np.zeros(len(feature_list))
    for idx, pair in enumerate(feature_list):
        for pair_idx in range(len(pair[0])):
            tmp = pair[0][pair_idx] - pair[1][pair_idx]
            distances[idx] += tmp * tmp

    return distances

Code to test the result is:

def main(args):
    d = 128
    n = 100
    array2 = [(np.random.rand(d)/4, np.random.rand(d)/3) for x in range(n)]

    result = sample.inference(array2)
    print(list(result)) # print result 1


    result = sample.inference1(array2)
    print(list(result)) # print result 2

The results are different when n reaches 100000, while the results stay the same when n is small.

Why would it happen? How can I get the same result?

The doc (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html) seems to say that "this is not the most precise way of doing this computation". Not sure if that is the explanation for your observation. — fountainhead, Mar 19 '19 at 08:45
I am confused about why the number of examples will affect the result of computation. — Tengerye, Mar 19 '19 at 08:51
After generating your 100000 values, suppose you restrict yourself to only the last 1000 of them, and ignore the rest of the 100000 values, do you still see an anomaly with the 1000 values? — fountainhead, Mar 19 '19 at 08:57
How do you test the equality between the two results ? I guess you don't just read the 100000 values ? — Robin, Mar 19 '19 at 09:09
@fountainhead You are correct. It seems it is the `print` of python that goes wrong. Why is that? BTW, I can't accept your answer if it is on the comment. — Tengerye, Mar 19 '19 at 09:18
@Tengerye: I am still not sure if the anomaly is in the `print`, or in the **actual values** that appear towards the end last part of your 100000 values. No worries about me not getting credit, because, I'm just throwing guesses at you, and trying to learn something along the way myself. — fountainhead, Mar 19 '19 at 09:27

score 0 · Accepted Answer · answered Mar 19 '19 at 09:23

In this minimal example, we see that the difference between the 2 results are negligible.

import numpy as np
from sklearn.metrics.pairwise import euclidean_distances

def inference_sklearn(feature_list):
    distances = np.zeros(len(feature_list))
    for idx, pair in enumerate(feature_list):
        distances[idx] = euclidean_distances(pair[0].reshape((1, -1)), pair[1].reshape((1, -1))).item()
        distances[idx] = distances[idx] * distances[idx]
    return distances

def inference_python(feature_list):
    distances = np.zeros(len(feature_list))
    for idx, pair in enumerate(feature_list):
        for pair_idx in range(len(pair[0])):
            tmp = pair[0][pair_idx] - pair[1][pair_idx]
            distances[idx] += tmp * tmp

    return distances


d = 128
ns = [100, 1000, 10000, 100000, 200000]
for n in ns: 
    print("n =", n)
    test_array = [(np.random.rand(d)/4, np.random.rand(d)/3) for x in range(n)]
    result_sklearn = inference_sklearn(test_array)
    result_python = inference_python(test_array)
    print(euclidean_distances([result_sklearn], [result_python])[0][0])

Output:

n = 100
0.0
n = 1000
0.0
n = 10000
0.0
n = 100000
0.0
n = 200000
1.52587890625e-05

Don't just print your results when you want to test equality. Also you can use the numpy.set_printoptions to deal with the printing quality of your arrays.

Euclidean distance: results are different between python and numpy with large number of instances

1 Answers1