20

I have an array of 3D points:

a = np.array([[2., 3., 8.], [10., 4., 3.], [58., 3., 4.], [34., 2., 43.]])

How can I compute the geometric median of those points?

orlp
  • 112,504
  • 36
  • 218
  • 315
  • Does this help: http://stackoverflow.com/questions/21617194/mode-median-mean-of-a-3d-numpy-array? – EdChum May 18 '15 at 09:24
  • @EdChum Don't think so, I've looked at the implementation of `np.median`, and it seems to be implemented in terms of `partition` - that can't be right for geometric median. – orlp May 18 '15 at 09:36

3 Answers3

34

I implemented Yehuda Vardi and Cun-Hui Zhang's algorithm for the geometric median, described in their paper "The multivariate L1-median and associated data depth". Everything is vectorized in numpy, so should be very fast. I didn't implement weights - only unweighted points.

import numpy as np
from scipy.spatial.distance import cdist, euclidean

def geometric_median(X, eps=1e-5):
    y = np.mean(X, 0)

    while True:
        D = cdist(X, [y])
        nonzeros = (D != 0)[:, 0]

        Dinv = 1 / D[nonzeros]
        Dinvs = np.sum(Dinv)
        W = Dinv / Dinvs
        T = np.sum(W * X[nonzeros], 0)

        num_zeros = len(X) - np.sum(nonzeros)
        if num_zeros == 0:
            y1 = T
        elif num_zeros == len(X):
            return y
        else:
            R = (T - y) * Dinvs
            r = np.linalg.norm(R)
            rinv = 0 if r == 0 else num_zeros/r
            y1 = max(0, 1-rinv)*T + min(1, rinv)*y

        if euclidean(y, y1) < eps:
            return y1

        y = y1

In addition to the default SO license terms, I release the code above under the zlib license, if you so prefer.

orlp
  • 112,504
  • 36
  • 218
  • 315
  • I am currently looking for alternatives to using ArcGIS and this gets very close to the median center from using their Median Center Geoprocessing tool (+- 1m over 700+ pts). Any chance you could comment what is happening in the code? I'm not as suave with Python and Numpy yet. I have input a numpy array for X that contains the xy coords of the pts. I would like to know what is happening in the function if you have the tme to explain. Cheers – Clubdebambos Mar 08 '17 at 15:32
  • 1
    @Clubdebambos I'm not certain what you'd even want me to explain. You want to know why it differs by +- 1m? I don't have access to their code, so I have no clue what they're doing. And even if I did there's many reasons in both my and their code (numerical instability, floating point errors, multiple possible median candidates, a bug in the algorithm or implementation), and I can't really be bothered to find out which it is. – orlp Mar 08 '17 at 16:40
  • I am not looking to compare, I want to implement your code in a workflow. At the moment I'd just be using it blindly, which I'm ok with, but I'd rather have some idea what is happening in the geometric_median function. I'm not at a level of coding that you are at so I can't follow it. – Clubdebambos Mar 08 '17 at 16:46
  • Oh like that. Sorry I don't think commenting line by line would be very useful, you can read the paper that I've linked. – orlp Mar 08 '17 at 17:05
  • Cheers, I'll have a read and then get into learning the ins and outs of numpy and scipy. Fantastic function by the way. Perfect for my needs. – Clubdebambos Mar 08 '17 at 17:09
  • @Nabla It's been a while since I wrote the code, but I assume the same format as in the question. – orlp Oct 04 '17 at 21:24
  • I have the same question about the above code. I'm trying to implement this geometric_median in my project. I have a numpy array containing WorkerID, Latitude, and Longitude. I want to get the median Latitude and Longitude per worker. Should X be the tuple Latitude, Longitude? And what would [y] be? – salvationishere Jun 15 '18 at 21:22
  • In some settings the convergence condition may not be met `||y - y1|| < eps`. Hence, I suggest replacing it with `||y - y1|| / ||y|| < eps`, which is more robust to oscillations or adding a maximum amount of iterations to avoid a rare (but possible) infinite loop. – lfelix Apr 08 '20 at 00:58
9

The calculation of the geometric median with the Weiszfeld's iterative algorithm is implemented in Python in this gist or in the function below copied from the OpenAlea software (CeCILL-C license),

import numpy as np
import math
import warnings

def geometric_median(X, numIter = 200):
    """
    Compute the geometric median of a point sample.
    The geometric median coordinates will be expressed in the Spatial Image reference system (not in real world metrics).
    We use the Weiszfeld's algorithm (http://en.wikipedia.org/wiki/Geometric_median)

    :Parameters:
     - `X` (list|np.array) - voxels coordinate (3xN matrix)
     - `numIter` (int) - limit the length of the search for global optimum

    :Return:
     - np.array((x,y,z)): geometric median of the coordinates;
    """
    # -- Initialising 'median' to the centroid
    y = np.mean(X,1)
    # -- If the init point is in the set of points, we shift it:
    while (y[0] in X[0]) and (y[1] in X[1]) and (y[2] in X[2]):
        y+=0.1

    convergence=False # boolean testing the convergence toward a global optimum
    dist=[] # list recording the distance evolution

    # -- Minimizing the sum of the squares of the distances between each points in 'X' and the median.
    i=0
    while ( (not convergence) and (i < numIter) ):
        num_x, num_y, num_z = 0.0, 0.0, 0.0
        denum = 0.0
        m = X.shape[1]
        d = 0
        for j in range(0,m):
            div = math.sqrt( (X[0,j]-y[0])**2 + (X[1,j]-y[1])**2 + (X[2,j]-y[2])**2 )
            num_x += X[0,j] / div
            num_y += X[1,j] / div
            num_z += X[2,j] / div
            denum += 1./div
            d += div**2 # distance (to the median) to miminize
        dist.append(d) # update of the distance evolution

        if denum == 0.:
            warnings.warn( "Couldn't compute a geometric median, please check your data!" )
            return [0,0,0]

        y = [num_x/denum, num_y/denum, num_z/denum] # update to the new value of the median
        if i > 3:
            convergence=(abs(dist[i]-dist[i-2])<0.1) # we test the convergence over three steps for stability
            #~ print abs(dist[i]-dist[i-2]), convergence
        i += 1
    if i == numIter:
        raise ValueError( "The Weiszfeld's algoritm did not converged after"+str(numIter)+"iterations !!!!!!!!!" )
    # -- When convergence or iterations limit is reached we assume that we found the median.

    return np.array(y)

Alternatively, you could use the C implementation, mentionned in this answer, and interface it to python with, for instance, ctypes.

rth
  • 10,680
  • 7
  • 53
  • 77
  • This looks slow because it's pure Python - I'm looking for a fast numpy/scipy solution. – orlp May 18 '15 at 09:39
  • 1
    @orlp, a fast numpy/scipy solution requires that the code can be vectorized. From the first glance it is not obvious if that's possible at all. I guess the question here is: How fast do you have to get. I guess writing a `cython` version of the gist already gives you very good speed. Using a c-implementation as suggested in this answer could be even faster. – cel May 18 '15 at 09:49
  • @rth It actually looks quite vectorizable to me, I'll give it a go. – orlp May 18 '15 at 10:02
  • @orlp Yes, you are right. Using `scipy.spatial.distance.cdist` for the distance calculation, should speed this up. Although, the `while` loop can't be avoided because of the iterative nature of the algorithm. BTW, if you manage to optimize the solution, feel free to contribute it back to the OpenAlea software. – rth May 18 '15 at 10:06
  • @rth Exactly, I'm using `cdist` for my solution. I'm quite new to numpy/scipy however, so it will take me some time to cook it up :) – orlp May 18 '15 at 10:08
  • 1
    @orlp why should pure Python be slow? have you tried to use an [LLVM](https://en.wikipedia.org/wiki/LLVM) compiler such as [Numba](http://numba.pydata.org/)? – Aprillion May 18 '15 at 10:22
  • @Aprillion Python is incredibly slow compared to the vectorization you would get with numpy. – orlp May 18 '15 at 10:23
  • @cel See my answer for a vectorized numpy implementation. – orlp May 18 '15 at 13:56
  • it looks like this one returns the wrong answer for [[1, 1, 0], [1, -1, 0], [-1, -1, 0], [-1, 1, 0]] – shoosh Jun 12 '17 at 12:04
3

The problem can be easily approximated with minimize module in scipy. In this module, it provides various optimization algorithms, from nelder-mead to newton-CG. Nelder-mead algorithm is particular useful if you do not want to bother with high order derivatives, at a cost of losing some precision. Nevertheless, you just need to know the function to be minimized for nelder-mead algorithm to work.

Now, referring to the same array in the questions, if we use @orlp's method, we will get this:

geometric_median(a)
# array([12.58942481,  3.51573852,  7.28710661])

For Nelder-mead method, you will see below. The function to be minimized is the distance function from all points i.e.

a formula

Here is the code:

from scipy.optimize import minimize
x = [point[0] for point in a]
y = [point[1] for point in a]
z = [point[2] for point in a]

x0 = np.array([sum(x)/len(x),sum(y)/len(y), sum(z)/len(z)])
def dist_func(x0):
    return sum(((np.full(len(x),x0[0])-x)**2+(np.full(len(x),x0[1])-y)**2+(np.full(len(x),x0[2])-z)**2)**(1/2))
res = minimize(dist_func, x0, method='nelder-mead', options={'xtol': 1e-8, 'disp': True})
res.x
# array([12.58942487,  3.51573846,  7.28710679])

Note that I use to mean of all points as the initial values for the alogrithm. The result is quite close to @orlp's method, which is more accurate. As I mentioned, you sacrifice a bit but still get quite good approximates.

Performance of Nelder Mead Algorithm For this, I generated a test_array with 10000 entries of points from normal distribution centred at 3.2. Therefore, the geometric median should be quite close to [3.2, 3.2, 3.2].

np.random.seed(3)
test_array = np.array([[np.random.normal(3.2,20),
                        np.random.normal(3.2,20),
                        np.random.normal(3.2,20)] for i in np.arange(10000)])

For @orlp's method,

%timeit geometric_median(test_array)
# 12.1 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# array([2.95151061, 3.14098477, 3.01468281])

For Nelder mead,

%timeit res.x
# 565 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# array([2.95150898, 3.14098468, 3.01468276])

@orlp's method is very fast while Nelder mead is not bad. However, Nelder mead method is generic whereas @orlp's is specific to geometric median. The method you would like to choose would depend on your purpose. If you just want an approximate, I will choose Nelder at ease. If you want to be exact, @orlp's method is both faster and more accurate.

shiba
  • 61
  • 4