Sub Gradient of One Dimensional Least Absolute Difference / Deviation (LAD)

Question

I am trying to solve the following one-dimensional least absolute difference (LAD) optimization problem

and I am using bisection method to find the best beta (a scalar). I have the following code:

import numpy as np
x = np.random.randn(10)
y = np.sign(np.random.randn(10))  + np.random.randn(10)

def find_best(x, y):
    best_obj = 1000000
    for beta in np.linspace(-100,100,1000000):
        if np.abs(beta*x - y).sum() < best_obj:
            best_obj = np.abs(beta*x - y).sum()
            best_beta = beta
    return best_obj, best_beta

def solve_bisect(x, y):
    it = 0
    u = 100
    l = -100
    while True:
        it +=1

        if it > 40:
            # maxIter reached
            return obj, beta, subgrad

        # select the mid-point
        beta = (l + u)/2

        # subgrad calculation. \partial |x*beta - y| = sign(x*beta - y) * x. np.abs(x * beta -y) > 1e-5 is to avoid numerical issue
        subgrad = (np.sign(x * beta - y) * (np.abs(x * beta - y) > 1e-5) * x).sum()
        obj = np.sum(np.abs(x * beta - y))
        print 'obj = %f, subgrad = %f, current beta = %f' % (obj, subgrad, beta)

        # bisect. check subgrad to decide which part of the space to cut out
        if np.abs(subgrad) <1e-3:
            return obj, beta, subgrad
        elif subgrad > 0:
            u = beta + 1e-3
        else:
            l = beta - 1e-3
brute_sol = find_best(x,y)
bisect_sol = solve_bisect(x,y)
print 'brute_sol: obj = %f, beta = %f' % (brute_sol[0], brute_sol[1])                                                                                                                                         
print 'bisect_sol: obj = %f, beta = %f, subgrad = %f' % (bisect_sol[0], bisect_sol[1], bisect_sol[2])

As you can see I also have a brute-force implementation that searches over the space to get the oracle answer (up to some numerical error). Every run can find the optimal best and minumum objective value. However, the subgrad is not 0 (not even close). For instance, one of my run I got the following:

brute_sol: obj = 10.974381, beta = -0.440700
bisect_sol: obj = 10.974374, beta = -0.440709, subgrad = 0.840753

Objective value and best are correct, but subgrad is not close to 0 at all. So the question:

why the subgrad is not close to 0 ? Isn't the optimal condition being 0 is in the subdifferential if and only if it's optimal?
What stopping criterion should we use instead?

score 0 · Answer 1 · answered Mar 22 '17 at 19:47

I'm not familiar with the term subgradient but to understand why the subgrad you calculate will often not be 0 let us look at the following simple example: x1 = 1000, x2 = 1, y1 = 0, y2 = 1.

The minimum is obviously 1 and attained at beta=0. but subgrad would equal -1. But observe, that at beta=0+eps the gradient is 999 and at beta=0-eps the gradient is -1001, which suggests the correct criterion:

lim beta->beta0-0 nabla_beta < 0 and lim beta->beta0+0 nabla_beta > 0

score 0 · Answer 2 · answered Dec 20 '17 at 04:47

I'm not an expert in subderivatives but my understanding is that at a nondifferentiable point, a function will generally have many subderivatives. For example, for the absolute value function at 0, y = m*x where |m| < 1 will all be subtangents. Obviously the minimum is at 0 but 1 is certainly a valid subgradient.

As to your problem, I think there are some faster ways to do it. For one thing, you know a solution must occur at one of these knot points (i.e. a point where the function is nondifferentiable). These nondifferentiable points occur at the n points beta = y_i / x_i. The first approach would be to just compute the objective for each of the n points and take the smallest (which is O(n^2)). The second method is to sort the candidate list of solutions (n*log(n)) and then perform bisection (log(n)). The code shows the brute force, the try all non differentiable points, and the bisection method (there may be some corner cases I haven't thought all the way through). I plotted an example of the objective as well so you can verify a subgradient need not be zero

import numpy as np
import matplotlib.pyplot as plt
import pdb

def brute_force(x,y):
    optimum = np.inf
    beta_optimal = 0

    for b in np.linspace(-5,5,1000):
        obj = np.sum(np.abs(b * x - y))
        if obj < optimum:
            beta_optimal = b
            optimum = obj
    return beta_optimal, optimum

def faster_solve(x, y):
    soln_candidates = y / (x + 1e-8) # hack for div by zero
    optimum = np.inf
    beta_optimal = 0
    for b in soln_candidates.squeeze():
        obj = np.sum(np.abs(b * x - y))
        if obj < optimum:
            beta_optimal = b
            optimum = obj

    return beta_optimal, optimum

def bisect_solve(x,y):
    soln_candidates = (y / (x + 1e-8)).squeeze() # avoid div by zero
    sorted_solns = np.sort(soln_candidates)
    indx_l = 0
    indx_u = x.shape[0] - 1

    while True:
        if (indx_l + 1 >= indx_u):
            beta_l = sorted_solns[indx_l]
            beta_u = sorted_solns[indx_u]

            obj_l = np.sum(np.abs(beta_l * x - y))
            obj_u = np.sum(np.abs(beta_u * x - y))

            if obj_l < obj_u:
                return beta_l, obj_l
            else:
                return beta_u, obj_u

        mid = int((indx_l + indx_u)/2)
        beta_mid = sorted_solns[mid]
        diff_mid = beta_mid * x - y
        subgrad_mid = np.sum(np.sign(diff_mid) * x)
        if subgrad_mid > 0:
            indx_u = mid
        elif subgrad_mid < 0:
            indx_l = mid

def main():
  np.random.seed(70963)
  N = 10
  x = np.random.randint(-9,9, (N,1))
  y = np.random.randint(-9,9, (N,1))

  num_plot_pts = 1000
  beta = np.linspace(-5,5, num_plot_pts)
  beta = np.expand_dims(beta, axis=1)
  abs_diff_mat = np.abs(np.dot(x, beta.T) - y)  # broadcasting!!
  abs_dev = np.sum(abs_diff_mat, axis=0)  # sum the rows together

  brute_optimal_beta, brute_optimum = brute_force(x,y)
  fast_beta, fast_optimum = faster_solve(x,y)
  bisect_beta, bisect_optimum = bisect_solve(x,y)

  print('Brute force beta: {0:.4f} with objective value {1:.4f}'.format(brute_optimal_beta, brute_optimum))
  print('Faster solve beta: {0:.4} with objective value {1:.4}'.format(fast_beta, fast_optimum))
  print('Bisection solve beta: {0:4} with objective value {1:4}'.format(bisect_beta, bisect_optimum))

  plt.plot(beta, abs_dev)
  plt.grid()
  plt.xlabel(r'$\beta$')
  plt.ylabel(r'$\sum_{i=1}^N |\beta x_i - y_i|$')
  plt.title(r'Absolute Deviation as function of $\beta$')
  plt.tight_layout()
  plt.show()

if __name__ == '__main__':
  main()

score 0 · Answer 3 · answered Aug 13 '19 at 05:22

Write you problem in the form of matrices:

$$ f \left( \beta \right) = {\left\| X \beta - y \right\|}_{2} $$

Then:

$$ \frac{\partial f \left( \beta \right) }{\partial \beta} = {X}^{T} \operatorname{sign} \left( X \beta - y \right ) $$

Which is indeed the Sub Gradient of the function.

Sub Gradient of One Dimensional Least Absolute Difference / Deviation (LAD)

3 Answers3