1

I have recently developed a scikit-learn estimator (a classifier) and I am now wanting to add sample_weight to the estimator. The reason is so I could apply boosting (ie. Adaboost) to the estimator (as Adaboost requires sample_weight to be present in the estimator).

I had a look at a few different scikit-learn estimators such as linear regression, logistic regression and SVM, but they all seem to have a different way of adding sample_weight into their estimators and it's not very clear to me:

Linear regression: https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/linear_model/_base.py#L375

Logistic regression: https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/linear_model/_logistic.py#L1459

SVM: https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/svm/_base.py#L796

So I am confused now and wanting to know how do I add sample_weight into my estimator? Is there a standard way of doing this in scikit-learn or it just depends on the estimator? Any templates or any examples would really be appreciated. Many thanks in advance.

Leockl
  • 1,906
  • 5
  • 18
  • 51
  • 1
    isn't just a list of values ? e.g. [0]*nrow(x) ?! I can make it as an answer ... – Areza May 12 '20 at 08:00
  • Thanks @user702846. Am I right to say this list of sample_weight values is to be multiplied with the feature matrix X and target vector y in the `.fit(X, y, sample_weight)` method? – Leockl May 12 '20 at 09:32
  • 1
    as its name suggest, you put weight on each sample - therefor your list's length must be the same as nrow(x) (the number of samples). [2]*nrow(x) - produces a list with a size of nrow(x) where values are 2 :) – Areza May 12 '20 at 09:35
  • Wouldn’t [2]*nrow(x) just be multiplying the values in each row by 2, rather than creating a duplicate 2nd row, which is what sample_weight is suppose to be doing? – Leockl May 12 '20 at 09:46
  • 1
    first of all - [2]*nrow(x) - just arbirtrary - you make a list of your own based on what you are going to use - here I just used it as an example to satisfy the parameter ! - regarding your python question - no ! it is in bracelets [ ] - so it won't be treated as a numeric, rather a list. so you rather multiplicate a list, rather than a number ! – Areza May 12 '20 at 09:56
  • Ok let me check this out and get back to you. This is probably a numpy arrays feature something like vectorisation/broadcasting. – Leockl May 12 '20 at 10:13
  • I tried this `np.array([2,2,1,1])*X` and it doesn't work, It just multiplies each row in X by each of the values in the first array, ie. 2 multiplied by the values in the 1st row, 2 multiplied by the values in the 2nd row, 1 multiplied by the values in the 3rd row and 1 multiplied by the values in the 4th row (X here is a feature matrix with 4 rows for this example). – Leockl May 12 '20 at 10:43
  • Sorry, I tried `np.array([2,2,1,1]).reshape(-1,1)*X` rather than `np.array([2,2,1,1])*X`. `np.array([2,2,1,1])*X` doesn't work because of mismatch in array shapes/sizes between the 2 arrays – Leockl May 12 '20 at 10:50
  • 2
    'np.array([2,2,1,1])*X' this is absolutely NOT the right way to do this. As you mention in your question, the approaches to use sample_weight varies a lot, and it depends on the internal implementation details and often there are more than one way to do it. So I recommend sharing those internal details of your estimator – Shihab Shahriar Khan May 12 '20 at 10:51
  • 1
    @Leockl - why don't you just use [2,2,1,1] then ? again - [2]*nrow was arbitrary - I am sorry for using pythonic expression. – Areza May 12 '20 at 11:27
  • Hi @Shihab Shahriar Khan, I think you have replied to one of my questions before (https://stackoverflow.com/questions/61556043/how-to-write-a-scikit-learn-estimator-in-pytorch) but didn't get any replies from you. Anyhow, its the same estimator: github.com/leockl/helstrom-quantum-centroid-classifier – Leockl May 12 '20 at 11:34
  • @user702846, I tried [2,2,1,1]*X and it doesn't work with an error of mismatch in array shapes/sizes – Leockl May 12 '20 at 11:35
  • Sorry for non-reply, I remember trying to understand that, but it was/is out of my depth – Shihab Shahriar Khan May 12 '20 at 11:52
  • 1
    @Leockl - just try [2,2,1,1] ---------- without any *X. – Areza May 12 '20 at 12:08
  • That's ok @Shihab Shahriar Khan, thanks for giving it a go – Leockl May 12 '20 at 12:21
  • @user702846, I don't get it. If I just have a list [2,2,1,1] which is the `sample_weight' for each row in the feature matrix X, how would I then use this list to turn the feature matrix X to have duplicate rows? – Leockl May 12 '20 at 12:24
  • @Leockl thanks for the downvote - I don't think in your question you are mentioning you need to duplicate your rows ... do you ? – Areza May 12 '20 at 13:18

0 Answers0