0

I am moving over some command line tool commands to a reusable Python script. However, I can't seem to get my head around the Python implementation of LibSVM. This is the CLI command (Linux) with the LibSVM library installed:

svm-scale -l 0 -u 1 -r models/feats.range input.feats > feats.scaled
svm-predict feats.scaled models/feats.model feats.pred

An example input file (input.feats) where UNK is the label to predict is as below. (Side note that I have found that when testing this on Windows, 'UNK' as an arbitrary value was not allowed and an integer needed to be passed, so libsvm/tools/checkdata.py tells me. I don't understand why though. On Linux there is no issue at all.)

UNK 1:4.458333333333333 2:24.0 3:0.20833333333333334 4:8.333333333333334 5:29.166666666666668 6:87.5 7:0.5 8:0.5 9:0.16666666666666666 10:0.16666666666666666 11:4.0 12:4.0 13:0.19047619047619047 14:0.041666666666666664 15:0.041666666666666664 16:1.0 17:1.0 18:0.047619047619047616 19:0.2916666666666667 20:0.25 21:7.0 22:6.0 23:0.2857142857142857 29:0.125 30:0.041666666666666664 31:3.0 32:1.0 33:0.047619047619047616

The first problem is that I cannot seem to find a way to implement svm-scale with a lower (-l) and upper (-u) bound, a given parameter model (-r), and an input file. The Python branch of the official LibSVM implementation is meager. I am not the only one with this question. This answer suggests to use sklearn.preprocessing and even though that'd work for simple -1,1 or 0,1 scaling, I want to scale based on previous parameters - as is possible with the -r (restore) argument in the CLI interface of svm-scale. I have not yet found a solution to this. How can I scale my data with previously saved parameters? An example of such a parameters file feats.range, looks like this:

x
0 1
1 3.88936170212766 6.346938775510204
2 7.34375 32.625
3 0.1188118811881188 0.4538461538461538
4 3.3003300330033 34.61538461538461
5 18.13471502590674 67.34693877551021
6 43.38235294117647 78.46153846153847
7 0.4794117647058824 0.7286821705426356
8 0.2713178294573644 0.5205882352941176
9 0.1808873720136519 0.5045045045045045
10 0.1148936170212766 0.4144144144144144
11 1.875 12.83333333333333
12 0.84375 10.33333333333333
13 0.217948717948718 0.6125
14 0.02006688963210702 0.1769230769230769
15 0.02006688963210702 0.1538461538461539
16 0.1875 4
17 0.15625 2.857142857142857
18 0.04477611940298507 0.2264150943396226
19 0.0796812749003984 0.2603550295857988
20 0.04682274247491638 0.2
21 1.5 5.777777777777778
22 0.8125 5
23 0.08490566037735849 0.3459119496855346
24 0 0.101010101010101
25 0 0.08856088560885608
26 0 2.444444444444445
27 0 2.25
28 0 0.1437125748502994
29 0.06825938566552901 0.1923076923076923
30 0.03105590062111801 0.1203703703703704
31 0.59375 5.5
32 0.3125 3
33 0.05220883534136546 0.1857142857142857
34 0 0.01818181818181818
35 0 0.5833333333333334
36 0 0.01558441558441558
37 0 0.5
38 0 0.01481481481481482
39 0 0.25
42 0 0.007281553398058253
43 0 0.1818181818181818

Even if that would be a success, I am not entirely sure how to proceed when loading a model and predicting the label. Would the following be correct? (Adapted from an example here.)

from libsvm.svm import *
from libsvm.svmutil import *

model = svm_load_model('models/feats.model')
# Given the scaled features 
pred = svm.libsvm.predict(feats_scaled, model)

An example of the model feats.model is given below.

svm_type epsilon_svr
kernel_type rbf
gamma 0.5
nr_class 2
total_sv 97
rho -0.333511
probA 0.161783
SV
0.003704278553649198 1:0.510292 2:0.192089 3:0.513893 4:0.548984 5:0.614196 6:0.422312 7:0.756692 8:0.243308 9:0.314877 10:0.286878 11:0.143726 12:0.169265 13:0.322193 14:0.446887 15:0.340164 16:0.239344 17:0.238347 18:0.373818 19:0.579746 20:0.336459 21:0.175325 22:0.0925373 23:0.322247 24:0.446311 25:0.416496 26:0.225 27:0.2 28:0.441021 29:0.341773 30:0.340589 31:0.15414 32:0.162791 33:0.400171 
-0.5 1:0.239529 2:0.509408 3:0.194768 4:0.140251 5:0.323725 6:0.282628 7:0.501399 8:0.498601 9:0.264498 10:0.249288 11:0.320659 12:0.315038 13:0.349058 14:0.485075 15:0.466071 16:0.460838 17:0.559229 18:0.604843 19:0.684191 20:0.591079 21:0.61039 22:0.46932 23:0.662154 24:0.353571 25:0.310211 26:0.295455 27:0.246914 28:0.358677 29:0.512774 30:0.236713 31:0.422505 32:0.276486 33:0.342528 34:0.302198 35:0.190476 38:0.370879 39:0.444444 
0.1394560286546107 1:0.155169 2:0.24953 3:0.139667 4:0.220046 5:0.162154 6:0.161413 7:0.42755 8:0.57245 9:0.169278 10:0.116145 11:0.1225 12:0.126426 13:0.221127 14:0.460867 15:0.540366 16:0.28154 17:0.408983 18:0.790225 19:0.581336 20:0.4428 21:0.238848 22:0.179753 23:0.570333 24:0.599045 25:0.575372 26:0.337945 27:0.309179 28:0.722944 29:0.219931 30:0.187145 31:0.144835 32:0.12639 33:0.338516 
-0.2177410947329247 1:0.407746 2:0.77446 3:0.370246 4:0.478536 5:0.363025 6:0.424849 7:0.392069 8:0.607931 9:0.306265 10:0.369993 11:0.516818 12:0.551465 13:0.429111 14:0.546025 15:0.533429 16:0.697352 17:0.853528 18:0.617085 19:0.381296 20:0.421772 21:0.584416 22:0.522388 23:0.407158 24:0.282857 25:0.322619 26:0.314685 27:0.34188 28:0.341095 29:0.831686 30:0.388049 31:0.819696 32:0.542039 33:0.453437 
0.07172083700103118 1:0.343284 2:0.567237 3:0.355343 4:0.320906 5:0.317058 6:0.486173 7:0.462343 8:0.537657 9:0.146061 10:0.297108 11:0.280368 12:0.376972 13:0.302624 14:0.552922 15:0.575728 16:0.55824 17:0.721618 18:0.637894 19:0.405324 20:0.407375 21:0.42447 22:0.371563 23:0.367107 24:0.913107 25:0.90443 26:0.818182 27:0.77193 28:0.922189 29:0.291093 30:0.00557018 31:0.340261 32:0.138311 34:0.400485 35:0.270677 42:1 43:0.868421 
-0.3094983056964442 1:0.436071 2:0.978574 3:0.498165 4:0.433746 5:0.249021 6:0.155329 7:0.264944 8:0.735056 9:0.0751117 10:0.00664228 11:0.429658 12:0.306257 13:0.0542717 14:0.534434 15:0.432468 16:0.825137 17:0.867769 18:0.632014 19:0.536559 20:0.440424 21:0.974026 22:0.681592 23:0.571392 24:0.54 25:0.58658 26:0.715909 27:0.740741 28:0.740248 29:0.643238 30:0.146672 31:0.847134 32:0.410853 33:0.286256 34:1 35:1 36:1 37:1 42:0.35671 43:0.458333 
-0.2127153117962579 1:0.29474 2:0.686222 3:0.277662 4:0.372119 5:0.0999395 6:0.104283 7:0.313798 8:0.686202 9:0.259288 10:0.13645 11:0.425563 12:0.316389 13:0.28685 14:0.269281 15:0.269159 16:0.354351 17:0.4548 18:0.409764 19:0.610769 20:0.548504 21:0.746254 22:0.577497 23:0.740365 24:0.339252 25:0.316589 26:0.346154 27:0.307692 28:0.414735 29:0.127795 30:0.0708427 31:0.302303 32:0.227191 33:0.204197 34:0.17134 35:0.131868 38:0.21028 39:0.307692 
-0.2496431871080472 1:0.510205 2:0.553358 3:0.461522 4:0.47673 5:0.435839 6:0.507867 7:0.625846 8:0.374154 9:0.285986 10:0.372825 11:0.361217 12:0.420417 13:0.385917 14:0.536158 15:0.550781 16:0.533698 17:0.682645 18:0.596873 19:0.539105 20:0.527372 21:0.532468 22:0.456053 23:0.473571 24:0.721875 25:0.676324 26:0.636364 27:0.567901 28:0.681028 29:0.352442 30:0.14796 31:0.365888 32:0.235142 33:0.150792 34:0.572917 35:0.380952 36:0.501302 37:0.333333 38:0.175781 39:0.222222 
-0.5 1:0.39885 2:0.558632 3:0.433279 4:0.549147 5:0.451879 6:0.436535 7:0.618319 8:0.381681 9:0.371905 10:0.38368 11:0.419011 12:0.430955 13:0.439957 14:0.40664 15:0.3375 16:0.422951 17:0.460496 18:0.365202 19:0.64188 20:0.667498 21:0.631169 22:0.570149 23:0.647734 24:0.215217 25:0.245471 26:0.190909 27:0.207407 28:0.257716 29:0.351007 31:0.368153 32:0.131783 33:0.00525235 
-0.128526024769383 1:0.370543 2:0.830243 3:0.374009 4:0.420573 5:0.336729 6:0.339568 7:0.566359 8:0.433641 9:0.422597 10:0.32342 11:0.65019 12:0.543359 13:0.418273 14:0.322087 15:0.311691 16:0.47541 17:0.590083 18:0.368456 19:0.535713 20:0.635179 21:0.818182 22:0.781095 23:0.673289 24:0.378529 25:0.398529 26:0.443182 27:0.444444 28:0.444149 29:0.706362 30:0.409689 31:0.779193 32:0.596899 33:0.525309 34:0.323529 35:0.285714 38:0.198529 39:0.333333 42:0.403922 43:0.458333 
-0.5 1:0.336929 2:0.619283 3:0.402381 4:0.461545 5:0.294114 6:0.560478 7:0.475044 8:0.524956 9:0.381502 10:0.487365 11:0.467681 12:0.543359 13:0.496372 14:0.241649 15:0.202083 16:0.300546 17:0.343251 18:0.164813 19:0.621826 20:0.616812 21:0.681818 22:0.58209 23:0.533444 24:0.394565 25:0.327295 26:0.375 27:0.296296 28:0.319923 29:0.413596 30:0.260786 31:0.43949 32:0.348837 33:0.254657 
-0.1284103133236481 1:0.522824 2:0.869798 3:0.583767 4:0.547795 5:0.508961 6:0.307428 7:0.736016 8:0.263984 9:0.459364 10:0.375196 11:0.711027 12:0.613611 13:0.511041 14:0.379192 15:0.246402 16:0.562842 17:0.51809 18:0.292481 19:0.649172 20:0.510371 21:1 22:0.681592 23:0.558851 24:0.525 25:0.384943 26:0.636364 27:0.444444 28:0.437937 29:0.579549 30:0.288445 31:0.716914 32:0.503876 33:0.394638 34:0.416667 35:0.380952 36:0.486111 37:0.444444 
-0.5 1:0.413909 2:0.526046 3:0.450952 4:0.458142 5:0.559618 6:0.469773 7:0.755839 8:0.244161 9:0.649274 10:0.459739 11:0.565454 12:0.460561 13:0.517085 14:0.269144 15:0.289706 16:0.288056 17:0.391736 18:0.294484 19:0.593168 20:0.688264 21:0.55102 22:0.556503 23:0.64914 24:0.0685121 25:0.078143 26:0.0584416 27:0.0634921 28:0.0804432 29:0.509709 30:0.272155 31:0.432211 32:0.30897 33:0.301686 34:0.190311 35:0.122449 38:0.233564 39:0.285714 
0.04821612562570891 1:0.367215 2:0.676818 3:0.433176 4:0.476298 5:0.417112 6:0.437687 7:0.477799 8:0.522201 9:0.405974 10:0.447972 11:0.525752 12:0.552939 13:0.52237 14:0.583064 15:0.322398 16:0.66617 17:0.51435 18:0.345844 19:0.361426 20:0.519472 21:0.478158 22:0.544098 23:0.499161 24:0.220818 25:0.251859 26:0.22314 27:0.242424 28:0.264241 29:0.528581 30:0.193376 31:0.546034 32:0.323467 33:0.225232 
0.5 1:0.401844 2:0.995056 3:0.517845 4:0.385894 5:0.50683 6:1 7:0.66892 8:0.33108 9:0.439376 10:0.64369 11:0.787072 12:0.964874 13:0.441535 14:0.362473 15:0.31 16:0.606557 17:0.682645 18:0.185287 19:0.495642 20:0.69869 21:0.935065 22:1 23:0.425939 24:0.152308 25:0.173718 26:0.204545 27:0.222222 28:0.136438 29:0.441905 30:0.341295 31:0.694268 32:0.627907 33:0.196415 
-0.1337953485189151 1:0.58915 2:0.747076 3:0.590697 4:0.550136 5:0.638569 6:0.435263 7:0.770819 8:0.229181 9:0.637202 10:0.566117 11:0.755484 12:0.697374 13:0.676848 14:0.657292 15:0.617229 16:0.798235 17:0.93897 18:0.716938 19:0.208225 20:0.287813 21:0.368631 22:0.375431 23:0.268554 24:0.319355 25:0.298021 26:0.346154 27:0.307692 28:0.313125 29:0.277149 30:0.111962 31:0.427732 32:0.284436 33:0.133262 34:0.322581 35:0.263736 38:0.395894 39:0.615385 
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
  • Please do explain the down vote so I can improve my post. – Bram Vanroy Jan 24 '18 at 16:28
  • If I understood correctly, you have a command line that works on *Lnx*, and you want to port it on *Win*. My 1st question is: is there a*Win* cmdline that works as expected (didn't work with *LibSVM* so don't know what platforms are (fully) supported)? The next thing is you want to use the *Python* interface (workaround like calling the cmdline from *Python* is not accepted (for good reason)). Could you add to the question all required inputs (if possible trim them down to keep them as simple as possible), the expected output (on *Lnx*), and the actual output. – CristiFati Jan 30 '18 at 10:06

3 Answers3

0

I believe you can accomplish the desired behavior with sklearn's sklearn.preprocessing.MinMaxScaler that can scale each feature is the desired range, as per MinMaxScaler(feature_range=(0, 1), copy=True). This is achieved by setting the feature_range. In terms of calculations the max and min values per feature are calculated during fit and used during predict.

svm-scale works similarly and in the file you suggested saves the scaling factors needed to transform the test data. Here, every feature will be transformed in the range set when first calling svm-scale. Scikit's scalers offer more flexibility as you can scale its feature using different scales, which may be count-intuitive though.

The transformers of scikit learn are best used for the whole pipeline, when first fitting the training data and then predicting for the test data. You can hack them for the case you have already the file with you indicate, but for the latter case it is probably better to write your own scaling function for having a cleaner code. For instance assuming having the file feats.range:

x
0 1
1 -1 1
2 2 18

the following function will produce the desired output.

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

def svm_scale(X, path="feats.range", new_range = (0,1)):
    f = open(path).read().splitlines()[2:] 
    f = np.array([x.split() for x in f]).astype(float)
    my_min, my_max = f[:,1], f[:,2] # second/third col is the feat. mins/max
    X_std = (X - my_min) / (my_max - my_min) # This brings at (0,1)
    X_scaled = X_std * (new_range[1] - new_range[0]) + new_range[0] # Modifies min/max
    return X_scaled

svm_scale(data) # returns array([[ 0., 0.],[ 0.25, 0.25],[ 0.5, 0.5 ], [1.,1.]])
geompalik
  • 1,582
  • 11
  • 22
  • Could you give an example how the feature range from a file can be used in scaling current feature values? – Bram Vanroy Jan 30 '18 at 17:08
  • Yes. In that case it is probably preferable to write your own function. – geompalik Jan 31 '18 at 09:12
  • Thanks for that. I am going to test it and get back at you. – Bram Vanroy Jan 31 '18 at 10:13
  • I am getting the following error `ValueError: operands could not be broadcast together with shapes (4,2) (41,)` when I try your exact code with my feats.range file (see my OP). Tried 3.6 as well as 2.7. – Bram Vanroy Jan 31 '18 at 10:51
  • Also, why is your data a list of lists? I was assuming that the input data were feature vectors of the type ` – Bram Vanroy Jan 31 '18 at 11:52
  • On how to read LibSVM format in python please check `http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html#sklearn.datasets.load_svmlight_file`. In applying ML you need to (i) Load the data, (ii) transform them (iii) apply a learning algo. – geompalik Jan 31 '18 at 12:49
0

I'm sorry I am having a little bit of trouble understanding what exactly you are trying to accomplish here.

It would be simpler if the question was "how do I create a model with previously saved feature weighting" OR "how do I resolve this specific linux / windows OS difference in this CLI?" But you acknowledged that in a comment.

To paraphrase, if we were using scikit-learn, you already have your "per feature relative scaling of the data"? As in, your feats.range example, zeroth column is index of feature, 1st column is min, 2nd column max? Why not just assign to the attribute of the scaler your already fitted mins and max?

from sklearn.preprocessing import minMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
scaler.transform(data)

scaler.scale_

returns array([ 0.5 , 0.0625])

but we can just force it to be whatever we want, for example your feats.range

from numpy import array
scaler.scale_ = array([12345, 54321]) #your values go here, I think?

scaler.scale_

returns array([12345, 54321])

Apologies if I misunderstood.

Dylan
  • 417
  • 4
  • 14
-1

I do not have a large amount of experience with the python API for LibSVM, but after a quick look through it doesn't seem like the complete command-line interface is exposed to python. I can't find anything resembling the svm-scale command in the python API either, so if you really want to do exactly that directly from python it doesn't seem like that'd be easy.

However, from your question I get the impression that running the command-line interface is not necessarily a problem, correct? You can install and run it, but would prefer to automate it through python. In this case, I'd suggest an alternative solution: use the command-line interface from python scripts.

For example, if the command you want to run is:

svm-scale -l 0 -u 1 -r models/feats.range input.feats > feats.scaled

I think you could do that with the following python code:

import subprocess
subprocess.Popen(["svm-scale", "-l", "0", "-u", "1", "r", "models/feats.range", "input.feats", ">", "feats.scaled"])

Note: I'm not 100% sure about those last two arguments to specify the output file, maybe other arguments of the subprocess.Popen() call would be better (see e.g. Python subprocess command with pipe). I am unable to play around with this right now, but can do so later and edit in the results if you're unable to see this answer before I get the time to actually test it.

Of course, an additional complication of this solution is that your scaled features would end up in a file, not directly in RAM accessible to python. So, you'll have to write a bit of additional code to read from the file afterwards (or maybe svm_read_problem() works? not sure). This code is probably much easier to write than trying to re-implement the svm-scale functionality in python yourself.

Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55
  • This is not the answer I'm looking for unfortunately. Running a sub process from Python is something I could do indeed, but I want a Python implementation to ensure that the behaviour is consistent. As I said in my OP, the behaviour in Windows and Linux are NOT consistent (cf. the UNK issue). – Bram Vanroy Jan 27 '18 at 13:26
  • @BramVanroy if the difference between Windows and Linux with respect to the UNK thing is the issue, can you clarify that a bit more? Exactly what commands do you run, and exactly what error do they give you on Windows but not on Linux? The main readme in the libsvm repository says the label has to be an integer, regardless of which OS you're on. In that `checkdata.py` I also don't see anything that would be OS-dependent, I'd expect it to give the same errors on both OSes – Dennis Soemers Jan 27 '18 at 14:10
  • I'm not sure if I should edit this post and turn it around to differences between the linux and windows implementation. I think it's against SO's rules to change the topic this drastically. I'll make a new post soon. – Bram Vanroy Jan 29 '18 at 09:32
  • @BramVanroy I think it should be fine to just edit such a clarification into this existing question; it makes sure other people won't suggest workarounds which don't work out for you (like I did) in the future to this question – Dennis Soemers Jan 29 '18 at 11:06