I am quite new in Liblinear/Libsvm and i'm with a good problem here.
I have very big data for training (2.883.584 samples highly unbalanced, each of them 21-dimensional) and also big data for testing (262.144 samples also with 21 dimensions). I'm using the linear kernel implementation of LIBSVM (or LibLinear) because the big data nature of my data. The literature warns me the issues of using RBF kernels with these data.
My problem is: no matter what i do, the classifier only predicts one class (the class with more samples, or the negative class in my experiments).
I tried so far:
1- Train balanced and imbalanced data, do not scale the data and no parameter selection.
2- Train balanced and imbalanced data, scale the data with different ranges ([-1,1] and [0,1]) but no parameter selection.
3- Train balanced and imbalanced data, scale the data with different ranges ([-1,1] and [0,1]) with parameter selection.
All of these experiments result an 81% of accuracy, but these right predictions are all from the negative class. All the positive classes are misclassified by the linear svm.
The .model file is very weird as you can see below:
solver_type L2R_L2LOSS_SVC_DUAL
nr_class 2
label 1 -1
nr_feature 21
bias -1
w
0
0
nan
nan
0
0
0
0
0
nan
nan
0
0
0
0
0
nan
nan
0
0
0
When I do the parameter selection via grid search the best C always gives me a 5-fold cross validation best accuracy of 50%. That's how I do the grid search in Matlab:
for log2c = 1:100,
cmd = ['-v 5 -c ', num2str(2^log2c)];
cv = train(label, inst, cmd);
if (cv >= bestcv),
bestcv = cv; bestc = 2^log2c;
end
fprintf('%g %g (best c=%g, rate=%g)\n', log2c, cv, bestc, bestcv);
end
EDIT: Here is one positive and negative sample of my training data:
1 1:4.896000e+01 2:3.374349e+01 3:2.519652e-01 4:1.289031e+00 5:48 6:4.021792e-01 7:136 8:4.069388e+01 9:2.669129e+01 10:-3.017949e-02 11:3.096163e+00 12:36 13:3.322866e-01 14:136 15:4.003704e+01 16:2.168262e+01 17:1.101631e+00 18:3.496498e+00 19:36 20:2.285381e-01 21:136
-1 1:5.040000e+01 2:3.251025e+01 3:2.260981e-01 4:2.523418e+00 5:48 6:4.021792e-01 7:136 8:4.122449e+01 9:2.680350e+01 10:5.681589e-01 11:3.273471e+00 12:36 13:3.322866e-01 14:136 15:4.027160e+01 16:2.245051e+01 17:6.281671e-01 18:2.977574e+00 19:36 20:2.285381e-01 21:136
And here is one positive and negative sample of my testing data:
1 1:71 2:2.562365e+01 3:3.154359e-01 4:1.728250e+00 5:76 6:0 7:121 8:7.067857e+01 9:3.185273e+01 10:-8.272995e-01 11:2.193058e+00 12:74 13:0 14:121 15:6.675556e+01 16:3.624485e+01 17:-1.863971e-01 18:1.382679e+00 19:76 20:3.533593e-01 21:128
-1 1:5.606667e+01 2:2.480630e+01 3:1.291811e-01 4:1.477127e+00 5:65 6:0 7:76 8:5.610714e+01 9:3.602092e+01 10:-9.018124e-01 11:2.236301e+00 12:67 13:4.912373e-01 14:128 15:5.886667e+01 16:3.891050e+01 17:-5.167622e-01 18:1.527146e+00 19:69 20:3.533593e-01 21:128
Is there something wrong with my data? should I increase the C range in grid-search? or should I use another classifier?