1

I am trying to use sklearn.datasets.dump_svmlight_file function to convert a dataset into an svmlight file. The problem that I have is that some columns are of type string.

At this moment, my X array looks similar to this:

[['C' '2'
  'String1']
 ['String2' '1959-6' 'String3' 'SCnc'
  'Bld']
 ...
       ]

When I execute, this error appears:

ValueError: could not convert string to float: 'C'

Actual code:

qid = df['qid'].to_numpy()
y = df['relevance'].to_numpy()
X = df[df.columns.difference(['qid', 'relevance'])].to_numpy()
dump_svmlight_file(X,y,'dataset.dat',query_id=qid)

So my question is: how can I encode the strings in order to fix the error?

rayqz
  • 249
  • 1
  • 8
  • enumerate strings in column and put numbers instead of strings - you can use [sklearn OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) or [pandas.get_dummies()](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) – furas Mar 06 '22 at 22:31
  • @furas My libsvm file has value 1 in all the rows but only for the corresponding feature_ids, is that correct? – rayqz Mar 06 '22 at 23:38
  • It seems OK. But you could show (in question) small example data before and after encoding. – furas Mar 07 '22 at 00:04

0 Answers0