Python sklearn.datasets.dump_svmlight_file failed to output the right index of column

Question

I want to execute SVM light and SVM rank,

so I need to process my data into the format of SVM light.

But I had a big problem....

My Python codes are below:

import pandas as pd
import numpy as np
from sklearn.datasets import dump_svmlight_file

self.df = pd.DataFrame()
self.df['patent_id'] = patent_id_list
self.df['Target'] = class_list
self.df['backward_citation'] = backward_citation_list
self.df['uspc_originality'] = uspc_originality_list
self.df['science_linkage'] = science_linkage_list
self.df['sim_bc_structure'] = sim_bc_structure_list
self.df['claim_num'] = claim_num_list
self.qid = dataset_list

X = self.df[np.setdiff1d(self.df.columns, ['patent_id','Target'])]
y = self.df.Target

dump_svmlight_file(X,y,'test.dat',zero_based=False, query_id=self.qid,multilabel=False)

The output file "test.dat" is look like this:

But the real data is look like this:

I got a wrong index....

Take first instance for example, the value of column 1 is 7, and the values of column 2~4 are zeros, the value of column 5 is 2....

So my expected result is look like this:

1 qid:1 1:7 5:2

but the column index of output file are totally wrong....

and unfortunately... I cannot figure out where is the problem occur....

I cannot fix this problem for a long time....

Thank you for help!!

score 2 · Answer 1 · answered Apr 01 '16 at 12:28

2

I change the data structure, I use np.array to produce array-like input. Finally, I succeed!

answered Apr 01 '16 at 12:28

陳冠穎

313
1
5
10

score 0 · Answer 2 · answered Nov 30 '16 at 04:55

0

If you're interested in loading into a numpy array, try:

X = clicks_train[:,0:2]
y = clicks_train[:,2]

where 2 is the index of the target column

answered Nov 30 '16 at 04:55

babalu

602
5
15

Python sklearn.datasets.dump_svmlight_file failed to output the right index of column

2 Answers2