5

I am using the library scikit-learn to perform Ridge Regression with weights on individual samples. This can be done by: esimator.fit(X, y, sample_weight=some_array). Intuitively, I expect that larger weights mean larger relevance for the corresponding sample.

However, I tested the method above on the following 2-D example:

    from sklearn import linear_model
    import numpy
    import matplotlib.pyplot as plt

    #Data
    x= numpy.array([[0], [1],[2]])
    y= numpy.array([[0], [2],[2]])
    sample_weight = numpy.array([1,1, 1])
    #Ridge regression
    clf = linear_model.Ridge(alpha = 0.1)
    clf.fit(x, y, sample_weight = sample_weight)
    #Plot
    xp = numpy.linspace(-1,3)
    yp=list()
    for x_i in xp:    
        yp.append(clf.predict(x_i)[0,0])
    plt.plot(xp,yp)
    plt.hold(True)
    x = list(x)
    y = list(y)
    plt.plot(x,y,'or')

I run this code, and I run it again doubling the weight of the first sample:

sample_weight = numpy.array([2,1, 1])

The resulting lines get away from the sample that has larger weight. This is counter-intuitive since I expect that the sample with larger weight has larger relevance.

Am I using wrongly the library, or is it there an error in it?

Marco
  • 3,053
  • 5
  • 27
  • 29
  • Have you tried doing the opposite. Maybe the weights are inverted. I've found similar things in the logistic regression class. Try to set it to numpy.array([0.5,1,1]). – Alex S Jul 15 '13 at 09:05
  • 2
    Thanks, this is what I am planning to do. However, I would like to understand why the weights are inverted. – Marco Jul 15 '13 at 11:26
  • Well, same here. The documentation for a lot of methods in sklearn is discouragingly simple. – Alex S Jul 16 '13 at 13:13

1 Answers1

2

The weights are not inverted. Probably you made a stupid mistake, or there was a bug in sklearn which is now fixed. The code

from sklearn import linear_model
import numpy
import matplotlib.pyplot as plt

#Data
x = numpy.array([[0], [1],[2]])
y = numpy.array([[0], [2],[2]])
sample_weight1 = numpy.array([1, 1, 1])
sample_weight2 = numpy.array([2, 1, 1])

#Ridge regressions
clf1 = linear_model.Ridge(alpha = 0.1).fit(x, y, sample_weight = sample_weight1)
clf2 = linear_model.Ridge(alpha = 0.1).fit(x, y, sample_weight = sample_weight2)

#Plot
plt.scatter(x,y)
xp = numpy.linspace(-1,3)
plt.plot(xp,clf1.predict(xp.reshape(-1, 1)))
plt.plot(xp,clf2.predict(xp.reshape(-1, 1)))
plt.legend(['equal weights', 'first obs weights more'])
plt.title('Increasing weight of the first obs moves the line closer to it');

plots me this graph, where the second line (with increased first weight) is closer to the first observation:

enter image description here

David Dale
  • 10,958
  • 44
  • 73