Using a custom metric with sklearn.neighbors.BallTree gives wrong input?

Question

I'm trying to use a custom metric with sklearn.neighbors.BallTree, but when it calls my metric the inputs do not look correct. If I use scipy.spatial.distance.pdist with the same custom metric, it works as expected. However, if I try to instantiate a BallTree, an exception is raised when I try to reshape the input. If I look at the actual inputs, the shape and values do not look correct.

import numpy as np
import scipy.spatial.distance as spdist
import sklearn.neighbors.ball_tree as ball_tree


# custom metric
def minimum_average_direct_flip(x, y):
    x = np.reshape(x, (-1, 3))
    y = np.reshape(y, (-1, 3))
    direct = np.mean(np.sqrt(np.sum(np.square(x - y), axis=1)))
    flipped = np.mean(np.sqrt(np.sum(np.square(np.flipud(x) - y), axis=1)))
    return min(direct, flipped)

# create an X to test
X = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9], [11, 12, 13, 14, 15, 16, 17, 18, 19], [21, 22, 23, 24, 25, 26, 27, 28, 29]])

# works as expected
distances = spdist.pdist(X, metric=minimum_average_direct_flip)

# outputs: [ 17.32050808  34.64101615  17.32050808]
print distances

# raises exception, inputs to minimum_average_direct_flip look wrong
# Traceback (most recent call last):
#   File ".../test_script.py", line 23, in <module>
#     ball_tree.BallTree(X, metric=minimum_average_direct_flip)
#   File "sklearn/neighbors/binary_tree.pxi", line 1059, in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn\neighbors\ball_tree.c:8381)
#   File "sklearn/neighbors/dist_metrics.pyx", line 262, in sklearn.neighbors.dist_metrics.DistanceMetric.get_metric (sklearn\neighbors\dist_metrics.c:4032)
#   File "sklearn/neighbors/dist_metrics.pyx", line 1091, in sklearn.neighbors.dist_metrics.PyFuncDistance.__init__ (sklearn\neighbors\dist_metrics.c:10586)
#   File "C:/Users/danrs/Documents/neuro_atlas/test_script.py", line 8, in minimum_average_direct_flip
#     x = np.reshape(x, (-1, 3))
#   File "C:\Anaconda2\lib\site-packages\numpy\core\fromnumeric.py", line 225, in reshape
#     return reshape(newshape, order=order)
# ValueError: total size of new array must be unchanged
ball_tree.BallTree(X, metric=minimum_average_direct_flip)

In the first call to minimum_average_direct_flip from the BallTree code, the inputs are:

x = [ 0.4238394   0.55205233  0.04699435  0.19542642  0.20331665  0.44594837 0.35634537  0.8200018   0.28598294  0.34236847]
y = [ 0.4238394   0.55205233  0.04699435  0.19542642  0.20331665  0.44594837 0.35634537  0.8200018   0.28598294  0.34236847]

These look completely incorrect. Am I doing something wrong in the way I am calling this or is this a bug in sklearn?

score 0 · Answer 1 · answered Aug 07 '16 at 16:21

It seems that this is a known issue: https://github.com/scikit-learn/scikit-learn/issues/6287

They do some kind of validation step that is problematic. As a workaround I guess I can add a check on the input size, but as the issue notes this is undesirable because I can't do actual validation checks myself.

Using a custom metric with sklearn.neighbors.BallTree gives wrong input?

1 Answers1