0

I am quite new in Liblinear/Libsvm and i'm with a good problem here.

I have very big data for training (2.883.584 samples highly unbalanced, each of them 21-dimensional) and also big data for testing (262.144 samples also with 21 dimensions). I'm using the linear kernel implementation of LIBSVM (or LibLinear) because the big data nature of my data. The literature warns me the issues of using RBF kernels with these data.

My problem is: no matter what i do, the classifier only predicts one class (the class with more samples, or the negative class in my experiments).

I tried so far:

1- Train balanced and imbalanced data, do not scale the data and no parameter selection.

2- Train balanced and imbalanced data, scale the data with different ranges ([-1,1] and [0,1]) but no parameter selection.

3- Train balanced and imbalanced data, scale the data with different ranges ([-1,1] and [0,1]) with parameter selection.

All of these experiments result an 81% of accuracy, but these right predictions are all from the negative class. All the positive classes are misclassified by the linear svm.

The .model file is very weird as you can see below:

solver_type L2R_L2LOSS_SVC_DUAL
nr_class 2
label 1 -1
nr_feature 21
bias -1
w
0 
0 
nan 
nan 
0 
0 
0 
0 
0 
nan 
nan 
0 
0 
0 
0 
0 
nan 
nan 
0 
0 
0 

When I do the parameter selection via grid search the best C always gives me a 5-fold cross validation best accuracy of 50%. That's how I do the grid search in Matlab:

for log2c = 1:100,
    cmd = ['-v 5 -c ', num2str(2^log2c)];
    cv = train(label, inst, cmd);
    if (cv >= bestcv),
        bestcv = cv; bestc = 2^log2c; 
    end
    fprintf('%g %g (best c=%g, rate=%g)\n', log2c, cv, bestc, bestcv);
end

EDIT: Here is one positive and negative sample of my training data:

 1  1:4.896000e+01 2:3.374349e+01 3:2.519652e-01 4:1.289031e+00 5:48 6:4.021792e-01 7:136 8:4.069388e+01 9:2.669129e+01 10:-3.017949e-02 11:3.096163e+00 12:36 13:3.322866e-01 14:136 15:4.003704e+01 16:2.168262e+01 17:1.101631e+00 18:3.496498e+00 19:36 20:2.285381e-01 21:136 
-1  1:5.040000e+01 2:3.251025e+01 3:2.260981e-01 4:2.523418e+00 5:48 6:4.021792e-01 7:136 8:4.122449e+01 9:2.680350e+01 10:5.681589e-01 11:3.273471e+00 12:36 13:3.322866e-01 14:136 15:4.027160e+01 16:2.245051e+01 17:6.281671e-01 18:2.977574e+00 19:36 20:2.285381e-01 21:136 

And here is one positive and negative sample of my testing data:

 1  1:71 2:2.562365e+01 3:3.154359e-01 4:1.728250e+00 5:76 6:0 7:121 8:7.067857e+01 9:3.185273e+01 10:-8.272995e-01 11:2.193058e+00 12:74 13:0 14:121 15:6.675556e+01 16:3.624485e+01 17:-1.863971e-01 18:1.382679e+00 19:76 20:3.533593e-01 21:128 
-1  1:5.606667e+01 2:2.480630e+01 3:1.291811e-01 4:1.477127e+00 5:65 6:0 7:76 8:5.610714e+01 9:3.602092e+01 10:-9.018124e-01 11:2.236301e+00 12:67 13:4.912373e-01 14:128 15:5.886667e+01 16:3.891050e+01 17:-5.167622e-01 18:1.527146e+00 19:69 20:3.533593e-01 21:128 

Is there something wrong with my data? should I increase the C range in grid-search? or should I use another classifier?

mad
  • 2,677
  • 8
  • 35
  • 78
  • 1
    Have you tried other models (maybe other kernels, RBF) to ensure it is not the problem of formatting the data? – Ray Dec 17 '13 at 12:56
  • @Ray: Look at my edit where I show some of my data. I think they are in Libsvm/Liblinear format, or no? I will try your suggestion related to other kernels anyway. Thanks for your reply. – mad Dec 17 '13 at 13:16
  • I am not sure since I am not really familiar with libsvm. Looks nice to me. :) – Ray Dec 17 '13 at 13:17
  • Some features in your dataset have values order of magnitude larger than other ones (i.e #20 vs. #21), and that will give you some numerical troubles, Have you tried to normalize your data? Also unbalanced data is a problem by its own, I would suggest you to try to solve a small balanced subset of your data and then when you have an estable model go from there and add the whole dataset – Pedrom Dec 17 '13 at 13:50
  • @Pedrom: I applied the Libsvm's svm-scale script which scales the data in a given interval (i tried [0,1] and [-1,1] intervals). What do you mean when you ask me to normalize the data? I also balanced the data by using the same number of positive and negative values in training but the problem still continues. Can you explain me better your suggestions? thanks. – mad Dec 17 '13 at 13:58
  • 1
    @mad Well the data that you posted is not scaled, so I am assuming that you didn't posted the data after performing svm-scale. Can you post that so we can give you better insights? From the model that you posted, as you guessed already, one can conclude that the algorithm is not learning and it looks like is because numerical problems in your dataset. – Pedrom Dec 17 '13 at 15:07

1 Answers1

1

For unbalanced case, the costs of false-positive and false-negative errors are not the same, so the penalty for positive and negative class should be different. You may need to choose the weight C+ and C- for each class. If you have more negative patterns than positive patterns then you probably want to make C+ larger than C−

model = svmtrain(trainLabels, trainFeatures, '-h 0 -b 1 -s 0 -t 0 -c 10 -w1 C+ -w-1 C-');

Usually C+ * N+ = C- * N- where N+ and N- are the sample numbers of positive and negative class respectively.

Also make sure you choose the correct options. For your case that training sample number is much larger than feature numbers, linear kernel is the best option as you said in your post.

lennon310
  • 12,503
  • 11
  • 43
  • 61
  • I disagree with the last bit. If the dimension of the input space is small and the data set is large – Dthal Dec 19 '13 at 06:17
  • Thank you Dthal. The logistic regression may be better in this case. Non-linear kernel may be best fit if the feature is small(10^0 - 10^3) while the data set is intermediate (10^1 - 10^4). – lennon310 Dec 19 '13 at 12:55