9

I want to study how to perform LIBSVM for regression and I'm currently stuck in preparing my data. Currently I have this form of data in .csv and .xlsx format and I want to convert it into libsvm data format.

Current Data

So far, I understand that the data should be in this format so that it can be used in LIBSVM:

LIBSVM format

Based on what I read, for regression, "label" is the target value which can be any real number.

I am doing a electric load prediction study. Can anyone tell me what it is? And finally, how should I organized my columns and rows?

Gabriel Luna
  • 127
  • 1
  • 9

1 Answers1

18

The LIBSVM data format is given by:

<label> <index1>:<value1> <index2>:<value2> ...
...
...

As you can see, this forms a matrix [(IndexCount + 1) columns, LineCount rows]. More precisely a sparse matrix. If you specify a value for each index, you have a dense matrix, but if you only specify a few indices like <label> <5:value> <8:value>, only the indices 5 and 8 and of course label will have a custom value, all other values are set to 0. This is just for notational simplicity or to save space, since datasets can be huge.

For the meanig of the tags, I cite the ReadMe file:

<label> is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). For regression, it's any real number. For one-class SVM, it's not used so can be any number. is an integer starting from 1, <value> is a real number. The indices must be in an ascending order.

As you can see, the label is the data you want to predict. The index marks a feature of your data and its value. A feature is simply an indicator to associate or correlate your target value with, so a better prediction can be made.

Totally Fictional story time: Gabriel Luna (a totally fictional character) wants to predict his energy consumption for the next few days. He found out, that the outside temperature from the day before is a good indicator for that, so he selects Temperature with index 1 as feature. Important: Indices always start at one, zero can sometimes cause strange LIBSVM behaviour. Then, he surprisingly notices, that the day of the week (Monday to Sunday or 0 to 6) also affects his load, so he selects it as a second feature with index 2. A matrix row for LIBSVM now has the following format:

<myLoad_Value> <1:outsideTemperatureFromYesterday_Value> <2:dayOfTheWeek_Value>

Gabriel Luna (he is Batman at night) now captures these data over a few weeks, which could look something like this (load in kWh, temperature in °C, day as mentioned above):

0.72 1:25 2:0
0.65 1:21 2:1
0.68 2:29 2:2
...

Notice, that we could leave out 2:0, because of the sparse matrix format. This would be your training data to train a LIBSVM model. Then, we predict the load of tomorrow as follows. You know the temperature of today, let us say 23°C and today is Tuesday, which is 1, so tomorrow is 2. So, this is the line or vector to use with the model:

0 1:23 2:2

Here, you can set the <label> value arbitrarily. It will be overwritten with the predicted value. I hope this helps.

thatguy
  • 21,059
  • 6
  • 30
  • 40
  • wow! thanks a lot for this very comprehensive explanation.. i was perfectly clueless about the data format for libsvm but this really helped me understand it.. thank you so much! – Gabriel Luna Nov 08 '16 at 15:35
  • 1
    One of the best explanations found on the web. What prominence does libSVM format hold when I try to build a model using SVM? Can't I just scale the data and run it through the algorithm to get a trained model? – Giridhar Karnik Jul 27 '17 at 17:44
  • @thatguy, I think I'm missing something... is the index-feature correspondence kept in a different file? It seems strange to me that only arbitrary indexes are used in the libsvm file... – user5029763 Apr 21 '19 at 16:42