6

I am doing a project using libsvm and I am preparing my data to use the lib. How can I convert CSV file to LIBSVM compatible data?

CSV File: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/data/iris.csv

In the frequencies questions:

How to convert other data formats to LIBSVM format?

It depends on your data format. A simple way is to use libsvmwrite in the libsvm matlab/octave interface. Take a CSV (comma-separated values) file in UCI machine learning repository as an example. We download SPECTF.train. Labels are in the first column. The following steps produce a file in the libsvm format.

matlab> SPECTF = csvread('SPECTF.train'); % read a csv file
matlab> labels = SPECTF(:, 1); % labels from the 1st column
matlab> features = SPECTF(:, 2:end); 
matlab> features_sparse = sparse(features); % features must be in a sparse matrix
matlab> libsvmwrite('SPECTFlibsvm.train', labels, features_sparse);
The tranformed data are stored in SPECTFlibsvm.train.
Alternatively, you can use convert.c to convert CSV format to libsvm format.

but I don't wanna use matlab, I use python.

I found this solution as well using JAVA

Can anyone recommend a way to tackle this problem ?

Patrick Weiß
  • 436
  • 9
  • 23
user3378649
  • 5,154
  • 14
  • 52
  • 76
  • Are you going to use `libsvm` executables? or Python binding? – emesday Apr 19 '14 at 13:22
  • If `libsvm`, you need to convert `csv` to `libsvm` data. If Python binding, you need to load `csv` to Python. – emesday Apr 19 '14 at 13:26
  • I am going to use libsvm executables. I found this one (https://github.com/seamusabshere/vector_embed), I am figuring out now if it's helpful. But I wanna split between predictors and target(which is one of columns). Does this affect ? – user3378649 Apr 19 '14 at 13:32
  • It seems to treat the first column is target. You need to modify the code properly. It's ruby code. Did you need to `Python version`? – emesday Apr 19 '14 at 13:38
  • This is first interaction with libsvm, I just need to know how to separate predictors (many columns) and target (one specific column). I'd use this script (https://github.com/zygmuntz/phraug/blob/master/csv2libsvm.py) I would be pleased if you can explain more. – user3378649 Apr 19 '14 at 13:43

2 Answers2

7

You can use csv2libsvm.py to convert csv to libsvm data

python csv2libsvm.py iris.csv libsvm.data 4 True

where 4 means target index, and True means csv has a header.

Finally, you can get libsvm.data as

0 1:5.1 2:3.5 3:1.4 4:0.2
0 1:4.9 2:3.0 3:1.4 4:0.2
0 1:4.7 2:3.2 3:1.3 4:0.2
0 1:4.6 2:3.1 3:1.5 4:0.2
...

from iris.csv

150,4,setosa,versicolor,virginica
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
...
emesday
  • 6,078
  • 3
  • 29
  • 46
  • 1
    I got altogether 16 features and my 16th feature is the class attribute, I have no headers how can i convert csv2libsvm using the above file – nifCody Dec 09 '15 at 06:11
  • I tried with a 2 column csv file and it didn't work. I run `python3 csv2libsvm.py P0.txt P0.data 2 True` and I got `Traceback (most recent call last): File "csv2libsvm.py", line 71, in label = line.pop(label_index) IndexError: pop index out of range ` – Henrique Andrade Dec 21 '19 at 15:41
5

csv2libsvm.py does not work with Python3, and also it does not support label targets (string targets), I have slightly modified it. Now It should work with Python3 as well as wıth the label targets. I am very new to Python, so my code may do not follow the best practices, but I hope it is good enough to help someone.

#!/usr/bin/env python

"""
Convert CSV file to libsvm format. Works only with numeric variables.
Put -1 as label index (argv[3]) if there are no labels in your file.
Expecting no headers. If present, headers can be skipped with argv[4] == 1.

"""

import sys
import csv
import operator
from collections import defaultdict

def construct_line(label, line, labels_dict):
    new_line = []
    if label.isnumeric():
        if float(label) == 0.0:
            label = "0"
    else:
        if label in labels_dict:
            new_line.append(labels_dict.get(label))
        else:
            label_id = str(len(labels_dict))
            labels_dict[label] = label_id
            new_line.append(label_id)

    for i, item in enumerate(line):
        if item == '' or float(item) == 0.0:
            continue
        elif item=='NaN':
            item="0.0"
        new_item = "%s:%s" % (i + 1, item)
        new_line.append(new_item)
    new_line = " ".join(new_line)
    new_line += "\n"
    return new_line

# ---

input_file = sys.argv[1]
try:
    output_file = sys.argv[2]
except IndexError:
    output_file = input_file+".out"


try:
    label_index = int( sys.argv[3] )
except IndexError:
    label_index = 0

try:
    skip_headers = sys.argv[4]
except IndexError:
    skip_headers = 0

i = open(input_file, 'rt')
o = open(output_file, 'wb')

reader = csv.reader(i)

if skip_headers:
    headers = reader.__next__()

labels_dict = {}
for line in reader:
    if label_index == -1:
        label = '1'
    else:
        label = line.pop(label_index)

    new_line = construct_line(label, line, labels_dict)
    o.write(new_line.encode('utf-8'))
Memin
  • 3,788
  • 30
  • 31