Estimate the optimal weight of each feature with regression

Question

I want to use feature extraction in my program and then estimate the optimal weight of each feature and compute the score of new input record.

For example, I have a paraphrase dataset. Each record in this dataset is a pair of two sentences that the similarity of two sentences is indicated with a value between 0 and 1. After I extracted e.g. 4 features, I create new dataset with these feature values and similarity scores. I want to use this new dataset to learn the weights:

Paraphrase dataset:

"A problem was solved by a mathematician"; "A mathematician was found a solution for a problem"; 0.9  
.  
.

New dataset:

0.42; 0.61; 0.21; 0.73; 0.9
.  
.

I want to use regression to estimate the weight of each feature. I want to compute the similarity of the input sentences in the program with equation 1: S = W1*F1 + W2*F2 + W3*F3 + W4*F4

I know the Regression algorithm could be used for this work but I don't know how? Please guide me to do this work? Is there any paper or document used the Regression algorithm?

You should reformulate your question: 1. It is not clear how does your data look like, are there many featuers for each object? What kind of features? Numerical? Categorical? 2. What do you mean by "classification algorithm" - you classified your data using some machine learning method? Or simply aplied some labels due to simple rules? 3. Your use of phrase "feature extraction" does not seem correct, what did you mean by "I want to use this feature extraction"? 4. What do you mean by "optimal weight"? Weight in the sense of weighted mean? Optimal in what sense? Classification accuracy? — lejlot, Aug 17 '13 at 09:40
Are you looking for the mathematical formulation of regression, or code implementation? If the former, please use sister site [CrossValidated](https://stats.stackexchange.com) — smci, May 01 '18 at 02:11

lejlot · Accepted Answer · 2013-08-17T18:32:51.317

2

What you are looking for is a simple linear regression (which by the way is not an algorithm, but rather - data modeling approach, algorithms are used for finding the linear regression parameters, but regression itself is not an algorithm), yet you should also add the bias (intercept) term to your equation so it becomes:

S = w1*f1 + w2*f2 + w3*f3 + w4*f4 + b

or in the vectorized format

s = <F,W> + b

where <F,W> is inner product of your weights and features, and b is bias (real valued variable)

to unify, you can add a constant value f5=1, and include w5 instead of b, so it becomes

s = <F,W>

You can solve it using Ordinary Least Squares method

W = (F'F)^(-1)F's

which results in optimal linear regression in terms sum of squared residuals.

In each programming language you will find libraries for performing linear regression, so you do not have to implement it by yourself. In particular, libraries will also take care of introducing the b variable, so there is no need to implement it by yourself.

edited Aug 17 '13 at 18:32

answered Aug 17 '13 at 15:37

lejlot

64,777
8
131
164

Thanks. But I don't know that W indicate the influence of each feature or is only a coefficient? And I don't understand why I should b in the feature set and what value should be set for b (why 1)? Thanks a lot for your attention. – SahelSoft Aug 17 '13 at 18:01
These coefficients can be interpreted as influences (to a limited extent). You need a `b` parameter so you can define any kind of hyperplane, without this parameter all your models (hyperplanes) have to go through the origin (so for the feature values `f1=f2=f3=f4=0` it **has to be** `s=0`, introducing the b parameter makes it possible to have `s>0` for `f1=f2=f3=f4=0`). We are not setting `b=1`, but `f5=1`, so we can think of `w5` as `b`- you could choose any non zero constant, that does not really matter. – lejlot Aug 17 '13 at 18:24
ok. I write a program in the Matlab that execute W=(F'F)^(-1)F's . The result indcate some feature has weight less than 0. Is it indicate that these features(for features have weight less than 0) are no important and I should remove it in computing S? – SahelSoft Aug 17 '13 at 18:38
1

No, negative values simply means, that these features are important in reducing the similarity. Consider a 1d linear function `f(x) = -x`. The best weight for the only dimension is `-1` - it does not mean, that `x` is unimportant for `f`, it simply means, that it has a "negative" effect on `f` value – lejlot Aug 17 '13 at 18:54
I have some features that they are penalty score and should reducing the similarity but two of them get positive score and the last one get negative score. It is not clear! – SahelSoft Aug 17 '13 at 19:04
1

You have chosen the most simple model, do not expect reasonable weights here. Semantic similiarity is a complex problem in NLP, you will not get "reasonable" weights after performing so easy operations. You simply have optimal weights (positive and negative ones) in sense of fitting a hyperplane to your data, that's all. SO is not a place for solving the modeling of semantics similarity, you just can get answers for your technical question, and this to my best knowledge has been clearly adressed. – lejlot Aug 17 '13 at 19:16
Could you please Just let me know the exact way of implementation of `W = (F'F)^(-1)F's ` in python – Jyotirmay Nov 17 '17 at 15:49

Estimate the optimal weight of each feature with regression

1 Answers1