Is there a reason why a feature only present in a given class is not being predicted strongly into that class?

Question

Summary & Questions

I'm using liblinear 2.30 - I noticed a similar issue in prod, so I tried to isolate it through a simple reduced training with 2 classes, 1 train doc per class, 5 features with same weight in my vocabulary and 1 simple test doc containing only one feature which is present only in class 2.

a) what's the feature value being used for?
b) I wanted to understand why this test document containing a single feature which is only present in one class is not being strongly predicted into that class?
c) I'm not expecting to have different values per features. Is there any other implications by increasing each feature value from 1 to something-else? How can I determine that number?
d) Could my changes affect other more complex trainings in a bad way?

What I tried

Below you will find data related to a simple training (please focus on feature 5):

> cat train.txt
1 1:1 2:1 3:1
2 2:1 4:1 5:1
> train -s 0 -c 1 -p 0.1 -e 0.01 -B 0 train.txt model.bin
iter  1 act 3.353e-01 pre 3.333e-01 delta 6.715e-01 f 1.386e+00 |g| 1.000e+00 CG   1
iter  2 act 4.825e-05 pre 4.824e-05 delta 6.715e-01 f 1.051e+00 |g| 1.182e-02 CG   1
> cat model.bin
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.3374141436539016
0
0.3374141436539016
-0.3374141436539016
-0.3374141436539016
0

And this is the output of the model:

solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.3374141436539016
0
0.3374141436539016
-0.3374141436539016
-0.3374141436539016
0
1 5:10

Below you will find my model's prediction:

> cat test.txt
1 5:1
> predict -b 1 test.txt model.bin test.out
Accuracy = 0% (0/1)
> cat test.out
labels 1 2
2 0.416438 0.583562

And here is where I'm a bit surprised because of the predictions being just [0.42, 0.58] as the feature 5 is only present in class 2. Why? So I just tried with increasing the feature value for the test doc from 1 to 10:

> cat newtest.txt
1 5:10
> predict -b 1 newtest.txt model.bin newtest.out
Accuracy = 0% (0/1)
> cat newtest.out
labels 1 2
2 0.0331135 0.966887

And now I get a better prediction [0.03, 0.97]. Thus, I tried re-compiling my training again with all features set to 10:

> cat newtrain.txt
1 1:10 2:10 3:10
2 2:10 4:10 5:10
> train -s 0 -c 1 -p 0.1 -e 0.01 -B 0 newtrain.txt newmodel.bin
iter  1 act 1.104e+00 pre 9.804e-01 delta 2.508e-01 f 1.386e+00 |g| 1.000e+01 CG   1
iter  2 act 1.381e-01 pre 1.140e-01 delta 2.508e-01 f 2.826e-01 |g| 2.272e+00 CG   1
iter  3 act 2.627e-02 pre 2.269e-02 delta 2.508e-01 f 1.445e-01 |g| 6.847e-01 CG   1
iter  4 act 2.121e-03 pre 1.994e-03 delta 2.508e-01 f 1.183e-01 |g| 1.553e-01 CG   1
> cat newmodel.bin
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.19420510395364846
0
0.19420510395364846
-0.19420510395364846
-0.19420510395364846
0
> predict -b 1 newtest.txt newmodel.bin newtest.out
Accuracy = 0% (0/1)
> cat newtest.out
labels 1 2
2 0.125423 0.874577

And again predictions were still ok for class 2: 0.87

hychou · Accepted Answer · 2020-02-03T12:20:10.063

a) what's the feature value being used for?

Each instance of n features is considered as a point in an n-dimensional space, attached with a given label, say +1 or -1 (in your case 1 or 2). A linear SVM tries to find the best hyperplane to separate those instance into two sets, say SetA and SetB. A hyperplane is considered better than other roughly when SetA contains more instances labeled with +1 and SetB contains more those with -1. i.e., more accurate. The best hyperplane is saved as the model. In your case, the hyperplane has formulation:

f(x)=w^T x

where w is the model, e.g (0.33741,0,0.33741,-0.33741,-0.33741) in your first case.

Probability (for LR) formulation:

prob(x)=1/(1+exp(-y*f(x))

where y=+1 or -1. See Appendix L of LIBLINEAR paper.

b) I wanted to understand why this test document containing a single feature which is only present in one class is not being strongly predicted into that class?

Not only 1 5:1 gives weak probability such as [0.42,0.58], if you predict 2 2:1 4:1 5:1 you will get [0.337417,0.662583] which seems that the solver is also not very confident about the result, even the input is exactly the same as the training data set.

The fundamental reason is the value of f(x), or can be simply seen as the distance between x and the hyperplane. It can be 100% confident x belongs to a certain class only if the distance is infinite large (see prob(x)).

c) I'm not expecting to have different values per features. Is there any other implications by increasing each feature value from 1 to something-else? How can I determine that number?

TL;DR

Enlarging both training and test set is like having a larger penalty parameter C (the -c option). Because larger C means a more strict penalty on error, intuitively speaking, the solver has more confidence with the prediction.

Enlarging every feature of the training set is just like having a smaller C. Specifically, logistic regression solves the following equation for w.

min 0.5 w^T w + C ∑i log(1+exp(−yi w^T xi))

(eq(3) of LIBLINEAR paper)

For most instance, yi w^T xi is positive and larger xi implies smaller ∑i log(1+exp(−yi w^T xi)). So the effect is somewhat similar to having a smaller C, and a smaller C implies smaller |w|.

On the other hand, enlarging the test set is the same as having a large |w|. Therefore, the effect of enlarging both training and test set is basically

(1). Having smaller |w| when training
(2). Then, having larger |w| when testing

Because the effect is more dramatic in (2) than (1), overall, enlarging both training and test set is like having a larger |w|, or, having a larger C.

We can run on the data set and multiply every features by 10^12. With C=1, we have the model and probability

> cat model.bin.m1e12.c1
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
3.0998430106024949e-12 
0 
3.0998430106024949e-12 
-3.0998430106024949e-12 
-3.0998430106024949e-12 
0 
> cat test.out.m1e12.c1
labels 1 2
2 0.0431137 0.956886

Next we run on the original data set. With C=10^12, we have the probability

> cat model.bin.m1.c1e12
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
3.0998430101989314 
0 
3.0998430101989314 
-3.0998430101989314 
-3.0998430101989314 
0 
> cat test.out.m1.c1e12
labels 1 2
2 0.0431137 0.956886

Therefore, because larger C means more strict penalty on error, so intuitively the solver has more confident with prediction.

d) Could my changes affect other more complex trainings in a bad way?

From (c) we know your changes is like having a larger C, and that will result in a better training accuracy. But it almost can be sure that the model is over fitting the training set when C goes too large. As a result, the model cannot endure the noise in training set and will perform badly in test accuracy.

As for finding a good C, a popular way is by cross validation (-v option).

Finally,

it may be off-topic but you may want to see how to pre-process the text data. It is common (e.g., suggested by the author of liblinear here) to instance-wise normalize the data.

For document classification, our experience indicates that if you normalize each document to unit length, then not only the training time is shorter, but also the performance is better.

Thanks for the answer. Would this method in liblinear-java be the way to find the best C? https://github.com/bwaldvogel/liblinear-java/blob/master/src/main/java/de/bwaldvogel/liblinear/Linear.java#L107 - If so I'm wondering what nr_folds is — Damiox, Feb 06 '20 at 01:33
I'm not familiar with liblinear-java, but in liblinear or libsvm, `nr_folds` is the number of folds in cross validation. You can google for k-folds cross validation. — hychou, Feb 06 '20 at 03:28

Is there a reason why a feature only present in a given class is not being predicted strongly into that class?

Summary & Questions

What I tried

1 Answers1

Finally,