Soft margin in linear support vector machine using python

Question

I'm learning support vector machine and trying to come up with a simple python implementation (I'm aware of the sklearn package, just to help understand the concepts better) that does simple linear classification. This is the major material I'm referencing.

I'm trying to solve the SVM from primal, by minimizing this:

The derivative of J wrt w is (according to the reference above):

So this is using the "hinge" loss, and C is the penalty parameter. If I understand correctly, setting a larger C will force the SVM to have harder margin.

Below is my code:

import numpy
from scipy import optimize

class SVM2C(object):
    def __init__(self,xdata,ydata,c=200.,learning_rate=0.01,
            n_iter=5000,method='GD'):

        self.values=numpy.unique(ydata)
        self.xdata=xdata
        self.ydata=numpy.where(ydata==self.values[-1],1,-1)
        self.c=c
        self.lr=learning_rate
        self.n_iter=n_iter
        self.method=method

        self.m=len(xdata)
        self.theta=numpy.random.random(xdata.shape[1])-0.5

    def costFunc(self,theta,x,y):
        zs=numpy.dot(x,theta)
        j=numpy.maximum(0.,1.-y*zs).mean()*self.c+0.5*numpy.sum(theta**2)
        return j

    def jac(self,theta,x,y):
        '''Derivative of cost function'''
        zs=numpy.dot(x,theta)
        ee=numpy.where(y*zs>=1.,0.,-y)[:,None]
        # multiply rows by ee
        dj=(ee*x).mean(axis=0)*self.c+theta
        return dj

    def train(self):

        #----------Optimize using scipy.optimize----------
        if self.method=='optimize':
            opt=optimize.minimize(self.costFunc,self.theta,args=(self.xdata,self.ydata),\
                    jac=self.jac,method='BFGS')
            self.theta=opt.x

        #---------Optimize using Gradient descent---------
        elif self.method=='GD':
            costs=[]
            lr=self.lr

            for ii in range(self.n_iter):
                dj=self.jac(self.theta,self.xdata,self.ydata)
                self.theta=self.theta-lr*dj
                cii=self.costFunc(self.theta,self.xdata,self.ydata)
                costs.append(cii)

            self.costs=numpy.array(costs)

        return self


    def predict(self,xdata):

        yhats=[]
        for ii in range(len(xdata)):
            xii=xdata[ii]
            yhatii=xii.dot(self.theta)
            yhatii=1 if yhatii>=0 else 0
            yhats.append(yhatii)
        yhats=numpy.array(yhats)

        return yhats



#-------------Main---------------------------------
if __name__=='__main__':

    xdata = numpy.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
    ydata = numpy.array([1, 1, 2, 2])

    mysvm=SVM2C(xdata,ydata,method='GD')
    mysvm.train()

    from sklearn import svm
    clf=svm.SVC(C=50,kernel='linear')
    clf.fit(xdata,ydata)

    print mysvm.theta
    print clf.coef_

    #-------------------Plot------------------------
    import matplotlib.pyplot as plt
    figure=plt.figure(figsize=(12,10),dpi=100)
    ax=figure.add_subplot(111)

    cmap=plt.cm.jet
    nclasses=numpy.unique(ydata).tolist()
    colors=[cmap(float(ii)/len(nclasses)) for ii in nclasses]

    #----------------Plot training data----------------
    for ii in range(len(ydata)):
        xii=xdata[ii][0]
        yii=xdata[ii][1]
        colorii=colors[nclasses.index(ydata[ii])]
        ax.plot(xii,yii,color=colorii,marker='o')

    plt.show(block=False)

So it is really a toy example where there are only 4 linearly separable training samples and I've dropped the bias term b, and the result w expected is [0.5, 0.5] (skimage result), while my implementation will tend to give something larger than 0.5 (e.g. [1.4650, 1.4650]), whether using gradient descent or scipy.optimize. And this only happens when setting the C parameter to >1, when C==1, it gives me [0.5, 0.5]. Also when C>1, the scipy.optimize would fail (I've tried a few different methods e.g. Newton-CG, BFGS), although the final result is close to the gradient descent result.

I'm bit confused why the w vector stops shrinking. I thought when all data are correctly classified, the slack penalties would stop contributing to the total cost function so it would only optimize J by decreasing the size of w. Did I get the derivatives wrong?

I know this might be a newbie question and I'm pasting some dirty code, this has been puzzling me for a few days and I have no one around me that could offer help, so any support will be much appreciated!

UPDATE:

Thanks for all the help. I'm updating the code to deal with a slightly more complicated sample. This time I included the bias term and used the following to update it:

And following the feedbacks I got, I tried Nelder-Mead for the scipy.optimize, and tried 2 adaptive gradient descent methods. Code below:

import numpy
from scipy import optimize

class SVM2C(object):
    def __init__(self,xdata,ydata,c=9000,learning_rate=0.001,
            n_iter=600,method='GD'):

        self.values=numpy.unique(ydata)
        # Add 1 dimension for bias
        self.xdata=numpy.hstack([xdata,numpy.ones([xdata.shape[0],1])])
        self.ydata=numpy.where(ydata==self.values[-1],1,-1)
        self.c=c
        self.lr=learning_rate
        self.n_iter=n_iter
        self.method=method

        self.m=len(xdata)
        self.theta=numpy.random.random(self.xdata.shape[1])-0.5

    def costFunc(self,theta,x,y):
        zs=numpy.dot(x,theta)
        j=numpy.maximum(0.,1.-y*zs).mean()*self.c+0.5*numpy.sum(theta[:-1]**2)
        return j

    def jac(self,theta,x,y):
        '''Derivative of cost function'''
        zs=numpy.dot(x,theta)
        ee=numpy.where(y*zs>=1.,0.,-y)[:,None]
        dj=numpy.zeros(self.theta.shape)
        dj[:-1]=(ee*x[:,:-1]).mean(axis=0)*self.c+theta[:-1] # weights
        dj[-1]=(ee*self.c).mean(axis=0)                      # bias

        return dj

    def train(self):

        #----------Optimize using scipy.optimize----------
        if self.method=='optimize':
            opt=optimize.minimize(self.costFunc,self.theta,args=(self.xdata,self.ydata),\
                    jac=self.jac,method='Nelder-Mead')
            self.theta=opt.x

        #---------Optimize using Gradient descent---------
        elif self.method=='GD':

            costs=[]
            lr=self.lr
            # ADAM parameteres
            beta1=0.9
            beta2=0.999
            epsilon=1e-8

            mt_1=0
            vt_1=0
            for ii in range(self.n_iter):
                t=ii+1
                dj=self.jac(self.theta,self.xdata,self.ydata)
                '''
                mt=beta1*mt_1+(1-beta1)*dj
                vt=beta2*vt_1+(1-beta2)*dj**2
                mt=mt/(1-beta1**t)
                vt=vt/(1-beta2**t)
                self.theta=self.theta-lr*mt/(numpy.sqrt(vt)+epsilon)
                mt_1=mt
                vt_1=vt

                cii=self.costFunc(self.theta,self.xdata,self.ydata)
                '''
                old_theta=self.theta
                self.theta=self.theta-lr*dj
                if ii>0 and cii>costs[-1]:
                    lr=lr*0.9
                    self.theta=old_theta


                costs.append(cii)
            self.costs=numpy.array(costs)

        self.b=self.theta[-1]
        self.theta=self.theta[:-1]

        return self


    def predict(self,xdata):

        yhats=[]
        for ii in range(len(xdata)):
            xii=xdata[ii]
            yhatii=numpy.sign(xii.dot(self.theta)+self.b)
            yhatii=xii.dot(self.theta)+self.b
            yhatii=self.values[-1] if yhatii>=0 else self.values[0]
            yhats.append(yhatii)
        yhats=numpy.array(yhats)

        return yhats

#-------------Main---------------------------------
if __name__=='__main__':

    #------------------Sample case 1------------------
    #xdata = numpy.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
    #ydata = numpy.array([1, 1, 2, 2])

    #------------------Sample case 2------------------
    from sklearn import datasets
    iris=datasets.load_iris()
    xdata=iris.data[20:,:2]
    ydata=numpy.where(iris.target[20:]>0,1,0)

    #----------------------Train----------------------
    mysvm=SVM2C(xdata,ydata,method='GD')
    mysvm.train()

    ntest=20
    xtest=2*(numpy.random.random([ntest,2])-0.5)+xdata.mean(axis=0)

    from sklearn import svm
    clf=svm.SVC(C=50,kernel='linear')
    clf.fit(xdata,ydata)

    yhats=mysvm.predict(xtest)
    yhats2=clf.predict(xtest)

    print 'mysvm weights:', mysvm.theta, 'intercept:', mysvm.b
    print 'sklearn weights:', clf.coef_, 'intercept:', clf.intercept_
    print 'mysvm predict:',yhats
    print 'sklearn predict:',yhats2

    #-------------------Plot------------------------
    import matplotlib.pyplot as plt
    figure=plt.figure(figsize=(12,10),dpi=100)
    ax=figure.add_subplot(111)

    cmap=plt.cm.jet
    nclasses=numpy.unique(ydata).tolist()
    colors=[cmap(float(ii)/len(nclasses)) for ii in nclasses]

    #----------------Plot training data----------------
    for ii in range(len(ydata)):
        xii=xdata[ii][0]
        yii=xdata[ii][1]
        colorii=colors[nclasses.index(ydata[ii])]
        ax.plot(xii,yii,color=colorii,marker='o',markersize=15)

    #------------------Plot test data------------------
    for ii in range(ntest):
        colorii=colors[nclasses.index(yhats2[ii])]
        ax.plot(xtest[ii][0],xtest[ii][1],color=colorii,marker='^',markersize=5)

    #--------------------Plot line--------------------
    x1=xdata[:,0].min()
    x2=xdata[:,0].max()

    y1=(-clf.intercept_-clf.coef_[0][0]*x1)/clf.coef_[0][1]
    y2=(-clf.intercept_-clf.coef_[0][0]*x2)/clf.coef_[0][1]

    y3=(-mysvm.b-mysvm.theta[0]*x1)/mysvm.theta[1]
    y4=(-mysvm.b-mysvm.theta[0]*x2)/mysvm.theta[1]

    ax.plot([x1,x2],[y1,y2],'-k',label='sklearn line')
    ax.plot([x1,x2],[y3,y4],':k',label='mysvm line')
    ax.legend(loc=0)
    plt.show(block=False)

The new problems I got:

it is unstable, depending on what the initial random parameters are, the results can be quite different. And about half the time, it will mis-classifiy 1 sample in the training set even if I've set C a quite large value. This happens to both scipy.optimize and GD.
ADAM approach tends to give infs for vt, as for large C, vt grows very fast. Am I getting the gradients wrong?

Tons of thanks in advance!

I am trying to follow your code flow, however I am struggling. 1) self.theta=aa.x is only called in one of the if branches 2) the second aa=optimize(..) is called outside the if statement. 3) I cannot see you calling mysvm.predict(). Hope it helps — Marcel Flygare, Feb 15 '18 at 10:49
@MarcelFlygare, thanks for pointing out. 1) this is the correct call of scipy.optimize. 2) this is me debuging the code and has been removed. 3) predict() is not called as I'm currently concerned with the return parameters. Predicted values are similar to sklearn results, but not consistently similar. — Jason, Feb 15 '18 at 13:02
I'll think you just have to dampen it a little bit now. With the given parameters your results depend highly on the initial random state for theta. I got better numbers for (C=20, lr=0.01) or (C=200, lr=0.001) — Marcel Flygare, Feb 15 '18 at 13:59
Yes it's rather unstable and that worries me, because the cost function of svm is said to be convex so there should be one global minimal. I guess most people solve the svm from its dual form, which I'm still trying to fully understand. — Jason, Feb 15 '18 at 14:27
Now I realized that [1.4650, 1.4650] and [0.5, 0.5] define the same plane/line, only differing by a scaling. But this is not incorporated into the formula, is it? And I still don't understand how C>1 makes scipy.optimize fail and C=1 doesn't. — Jason, Feb 16 '18 at 01:25
You can get the error by printing "opt" and make it go away by setting a tolerance of say (..., tol=0.1). (I've not done a further deeper analysis as you understand the problem far better than I do.) — Marcel Flygare, Feb 16 '18 at 08:51
In regards to basic optimization-theory: you are invalidating the assumptions of both approaches: smoothness (hinge-loss)! Additionally GDs convergence needs some care: e.g. line-search. — sascha, Feb 17 '18 at 12:59
A closer look at the reference suggests that when using the sub-gradient descent, the parameters w should be updated **per sample**, while I was taking the mean of all derivatives from all samples, so the resultant average could be wrong. But changing to a per-sample scheme only doesn't fix all problems, there is more I don't get. — Jason, Feb 17 '18 at 15:52

David Dale · Answer 1 · 2018-02-24T15:57:54.677

As for scipy.optimize, you misuse its optimization methods. Both Newton-CG and BFGS assume your cost function is smooth, which is not the case. If you use a robust gradient-free method, like Nelder-Mead, you will converge to the right point in most cases (I have tried it).

Your problem can be theoretically solved by gradient descent, but only if you adapt it to a non-smooth function. Currently, your algorithm approaches optimum quickly, but starts jumping around instead of converging, due to a large learning rate combined with a sharp change in gradient where the maximum in the cost function changes from 0 to positive:

You can calm these oscillations down by decreasing learning rate each time when your costs fails to decrease relative to the previous iteration

def train(self):

    #----------Optimize using scipy.optimize----------
    if self.method=='optimize':
        opt=optimize.minimize(self.costFunc,self.theta,args=(self.xdata,self.ydata),\
                jac=self.jac,method='BFGS')
        self.theta=opt.x

    #---------Optimize using Gradient descent---------
    elif self.method=='GD':
        costs=[]
        lr=self.lr

        for ii in range(self.n_iter):
            dj=self.jac(self.theta,self.xdata,self.ydata)
            old_theta = self.theta.copy()
            self.theta=self.theta-lr*dj
            cii=self.costFunc(self.theta,self.xdata,self.ydata)

            # if cost goes up, decrease learning rate and restore theta
            if len(costs) > 0 and cii > costs[-1]:
                lr *= 0.9
                self.theta = old_theta
            costs.append(cii)

        self.costs=numpy.array(costs)

    return self

This small amendment to your code results in much better convergence:

and in resulting parameters which are pretty close to the optimal - like [0.50110433 0.50076661] or [0.50092616 0.5007394 ].

In modern applications (like neural networks) this adaptation of learning rate is implemented within advanced gradient descent algorithms like ADAM, which constantly track changes in mean and variance of the gradient.

Update. This second part of the answer concerns the secont version of the code.

About ADAM. You got exploding vt because of the line vt=vt/(1-beta2**t). You should normalize only the value of vt used to calculate a gradient step, not the value that goes to the next iteration. Like here:

...
mt=beta1*mt_1+(1-beta1)*dj
vt=beta2*vt_1+(1-beta2)*dj**2
mt_temp=mt/(1-beta1**t)
vt_temp=vt/(1-beta2**t)
old_theta=self.theta
self.theta=self.theta-lr*mt_temp/(numpy.sqrt(vt_temp)+epsilon)
mt_1=mt
vt_1=vt
...

About instability. Both Nelder-Mead method and gradient descent depend on initial value of the parameters, that's the sad truth. You can try to improve convergence by making more iterations of GD and fading learning rate in a wiser way, or by decreasing optimization parameters like xatol and fatol for Nelder-Mead method.

However, even if you achieve perfect convergence (parameter values like [ 1.81818459 -1.81817712 -4.09093887] in your case), you have problems. Convergence can be roughly checked by the following code:

print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b+1e-3]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b-1e-3]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta-1e-3, [mysvm.b]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta+1e-3, [mysvm.b]]), mysvm.xdata, mysvm.ydata))

which results in

6.7323592305075515
6.7335116664996
6.733895813394582
6.745819882839341
6.741974212439457

Your cost increases if you change theta or the intercept in any direction - thus, the solution is optimal. But then solution of sklearn is not optimal (from the point of view of mysvm), because the code

print(mysvm.costFunc(numpy.concatenate([clf.coef_[0], clf.intercept_]), mysvm.xdata, mysvm.ydata))

prints 40.31527145374271! It means, you have reached a local minimum, but the sklearn's SVM has minimized something different.

And if you read the documentation of sklearn, you can find what's wrong: they minimize sum(errors) * C + 0.5 * penalty, and you minimize mean(errors) * C + 0.5 * penalty!!! This is the most probable cause of discrepancy.

Thanks for the insights! Your suggested answer helped, but I still get problems solving the primal for a slightly more complicated case (also linearly separable), this time with the bias term included in the gradients. I've tried ADAM as well, but it tends to give `inf`s in the `vt` term. And whether using gradient descent or Nelder-Meader, it always mis-classify 1 sample no matter how big `C` is. I'm starting to get the impression that the dual form is more robust. Should I update the question to include the newer case? — Jason, Feb 23 '18 at 02:34
@Jason, you got exploding ```vt``` because of the line ```vt=vt/(1-beta2**t)```. You should normalize only the value of ```vt``` used to calculate a gradient step, not the value that goes to the next iteration. — David Dale, Feb 23 '18 at 05:47
@DavidDale Indeed, thanks! Now it doesn't explode, but converges kind of slow (after ~8k iterations), and the single misclassification remains. — Jason, Feb 23 '18 at 06:46
@Jason Speed of convergence may be fixed by playing with learning rate. After convergence, you can check whether your solution is indeed a local optimum (cost is lower that in the neigbor points), and if it is the case, then you probably have kind of stupid mistake in the cost function. — David Dale, Feb 24 '18 at 06:17
@DavidDale, thanks for extension, but the difference between sum(errors) and mean(errors) is absorbed into the parameter `C`, and makes no difference eventually. If you read the documentation of `sklearn`, the paremeter `w` can be obtained from `dual_coef_` and `intercept_`, which means they solve the dual rather than primal. I've tried using `scipy.opitmize` to optimize the dual problem and get consistent (although at much slower speed) results as sklearn. — Jason, Feb 26 '18 at 04:21
continued: So I guess the real problem is this formulation of primal requires further treatment to make it globally convex, this simple form is mainly for derivation purposes. — Jason, Feb 26 '18 at 04:21

Soft margin in linear support vector machine using python

1 Answers1