2

I'm using a linear program to test if two sets of points are separable by a line. Theoretically this works, but in practice there seem to be numerical issues if the points are close to being colinear or if they are near to each other.

The package I'm using is scipy linprog.

I wrote some code to give an example of what I'm talking about. This code produces a cloud 'locations' of N points, and picks some random lines, then uses a margin around those lines to divide a subset of the 'locations' set into two regions, X and Y, and then checks if the linear program successfully finds a separating line between X and Y, or if the linear program concludes that there are no such separating line (fails to find a feasible point).

As you can see from the output (below), as the margin goes from 1 to 2^(-10), the probability that the linear program correctly detects that the two regions are linearly separable decays. This suggests that maybe there are some issues when detecting that very nearby point clouds can be separated.

Note that because I'm feeding the linear programs sets which are guaranteed to be linearly seperable, the outputs should all be 1.

import numpy as np
from scipy.optimize import linprog

N = 100

for minv in range(10):
    margin = 1/(2**minv)
    locations = np.random.uniform(-1,1,[N,2])
    tests = []
    for i in range(50):
        p = np.random.normal(0,1,[3])
        X = []
        Y = []
        for a in locations:
            if np.dot(a, [p[0],p[1]]) < p[2] - margin:
                X.append(a)
            if np.dot(a, [p[0],p[1]]) > p[2] + margin:
                Y.append(a)
        #X and Y are two linearly seperable subsets of 'locations'
        A = []
        b = []
        if X != [] and Y != []:
            #This if is just to avoid an edge case that causes problems with linprog
            for s in X:
                A.append( [-1*v for v in s] + [1] )
                b.append(-1)
            for s in Y:
                A.append( list(s) + [-1])
                b.append(-1)
            #See: https://www.joyofdata.de/blog/testing-linear-separability-linear-programming-r-glpk/
            res = linprog(c = [0,0,0], A_ub = A, b_ub = b, bounds = [-np.inf, np.inf])
            tests.append(res["success"])
    print(np.mean(tests))

Output:

1.0
1.0
0.909090909091
0.8
0.805555555556
0.5
0.375
0.444444444444
0.378378378378
0.410256410256

My question:

  1. How can I reliably (and efficiently) solve the problem of detecting whether two sets of points are linearly separable? (Another approach would be to build convex hulls, and I've written this out mostly. there are just some issues with using Qhull: Qhull Convex hull wants me to input at least 3 points )

  2. Is this a bug in scipy linprog? Or am I just not using it correctly?

Areawoman
  • 233
  • 1
  • 11
  • 1
    So it seems the monte-carlo nature might play a role for your interpretation and this part was not analyzed (how probably is that observation). It sounds like a really broad question. But if you ask: can i trust scipy.linprog? No, you can't. It's (imho) only useful for toy-problems (dense tableau; no presolve and co.; the interior-point method available in scipy 1.0 is quite good; but despite a self-dual design: infeasibility-detection usually worse than Simplex). Use a real solver, either one of the commercials or Clp, Google's ortools, Glpk... Maybe even SCIP's / Soplex rational-simplex! – sascha May 06 '18 at 00:06
  • @sascha since I fed sets which were constructed to be linearly separable , the feasible sets were never empty. So the outputs should have been all 1. Thanks for the comments - I'll try a different solver I guess. – Areawoman May 06 '18 at 00:44

1 Answers1

2

Here is a cleaned up version that appears to work. I removed the threshold variable since it can up to sign be absorbed into the normal vector. The downside is that we have to check two cases. My gut feeling is that the jump at zero in the threshold variable was throwing the solver off.

import numpy as np
from scipy.optimize import linprog

N = 100

def test(X, Y):
    if not (X.size and Y.size):
        return True, np.full((2,), np.nan)
    A = np.r_[-X, Y]
    b = np.repeat((-1, 1), (X.shape[0], Y.shape[0]))
    res1 = linprog(c=[0,0], A_ub=A, b_ub=b-1e-6, bounds=2*[(None, None)])
    res2 = linprog(c=[0,0], A_ub=A, b_ub=-b-1e-6, bounds=2*[(None, None)])
    if res1['success']:
        return True, res1['x']
    elif res2['success']:
        return True, res2['x']
    else:
        return False, np.full((2,), np.nan)

def split(locations, p, margin):
    proj = np.einsum('ij,j', locations, p[:2])
    X = locations[proj < p[2] - margin]
    Y = locations[proj > p[2] + margin]
    return X, Y

def validate(locations, p, p_recon, margin):
    proj = np.einsum('ij,j', locations, p[:2])
    X = locations[proj < p[2] - margin]
    Y = locations[proj > p[2] + margin]
    if not (X.size and Y.size):
        return True
    return ((np.all(np.einsum('ij,j', X, p_recon) > 1)
             and np.all(np.einsum('ij,j', Y, p_recon) < 1)) 
            or
            (np.all(np.einsum('ij,j', X, p_recon) > -1)
             and np.all(np.einsum('ij,j', Y, p_recon) < -1)))

def random_split(locations):
    inX = np.random.randint(0, 2, (locations.shape[0],), dtype=bool)
    return locations[inX], locations[~inX]

for minv in range(10):
    margin = 1/(2**minv)
    locations = np.random.uniform(-1,1,[N,2])
    P = np.random.normal(0, 1, (50, 3))
    tests, P_recon = zip(*(test(*split(locations, p, margin)) for p in P))
    assert all(validate(locations, *ps, margin) for ps in zip(P, P_recon))
    control = [test(*random_split(locations))[0] for p in P]
    print('test:', np.mean(tests), 'control:', np.mean(control))

Sample output:

test: 1.0 control: 0.0
test: 1.0 control: 0.0
test: 1.0 control: 0.0
test: 1.0 control: 0.0
test: 1.0 control: 0.0
test: 1.0 control: 0.0
test: 1.0 control: 0.0
test: 1.0 control: 0.0
test: 1.0 control: 0.0
test: 1.0 control: 0.0
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • I have no idea why this adjustment makes it work (much) better, but it does. Thank you! Unfortunately, if you put in N = 1000 the performance again declines (for small margins). This is frustrating because this toy example seems like the kind of problem that a this algorithm should be able to solve exactly. I suppose I'll have to try another solver after all. (BTW - from your point of view, what is the advantage gained by using einsums and zips, etc.?) – Areawoman May 07 '18 at 16:30
  • 2
    You can get it to work with `N=1000` by reducing the tolerance: adding ` options={'tol': 1e-10}` to both calls to 'linprog'. Re `einsum` etc. avoiding explicit loops gives you potentially huge speed gains (x100 and more in some cases) and is arguably more readable (once you get used to it, that is). – Paul Panzer May 07 '18 at 16:58