In Logistic Regression:
hypothesis function,
h(x) = ( 1 + exp{-wx} )^-1
where, w - weights/parameters to be fit or optimized
Cost function ( -ve log likelihood function ) is given as :
For a single training e.g.. (x,y):
l(w) = y * log ( h(x) ) + (1 - y) * log ( 1 - h(x) )
The goal is to maximize l(w) over all the training examples and thereby estimate w.
Question :
Consider the situation wherein there are many more positive (y=1) training examples than negative (y=0) training examples.
For simplicity:
if we consider only the positive (y=1) examples: Algorithm runs:
maximize ( l(w) )
=> maximize ( y * log ( h(x) ) )
=> maximize ( log( h(x) ) )
=> maximize ( h(x) ); since log(z) increases with z
=> maximize ( ( 1 + exp{-wx} )^-1 )
=> maximize ( wx );
since a larger wx will increase h(x) and move it closer to 1
In other words, the optimization algorithm will try to increase (wx) so as to better fit the data and increase likelihood.
However, it seems possible that there is an unintended way for the algorithm to increase (wx) but not improve the solution ( decision boundary ) in anyway :
by scaling w: w' = k*w ( where k is positive constant )
We can increase (k*wx) without changing our solution in anyway.
1) Why is this not a problem ? Or is this a problem ?
2) One can argue that in a dataset with many more positive examples than negative examples, the algorithm will try to keep increasing the ||w||.