0

i have been trying to tune my SVM using Gridsearchcv but it is throwing errors.

my code is :

train = pd.read_csv('train_set.csv')
label = pd.read.csv('lebel.csv')

params = { 'C' : [ 0.01 , 0.1 , 1 , 10]
clf = GridSearchCV(SVC() , params , n_jobs = -1)
clf.fit(train , label)

throws the error as : 'too much indices for array'

but when i simply do this :

clf = svc()
clf.fit(train.data , label.data)

the code works fine

Anurag Pandey
  • 373
  • 2
  • 5
  • 21
  • How is this a Pandas question? Seems more like SciPy to me... Also, consider including a complete verifiable and reproducible example. Which means give a small sample of your data to test answers on... – Kartik Aug 09 '16 at 09:47
  • @Kartik i have edited it , – Anurag Pandey Aug 10 '16 at 16:32

1 Answers1

1

I suspect the problem lies with your data structure train.data / label.data. I have tested both versions of your code and they work:

import sklearn.svm as sksvm
import sklearn.grid_search as skgs

params = { 'C' : [ 0.01 , 0.1 , 1 , 10]}
X = np.random.rand(1000, 10)  # (1000 x 10) matrix, 1000 points with 10 features
Y = np.random.randint(0, 2, 1000)  # 1000 array, binary labels

mod = sksvm.SVC()
mod.fit(X, Y)

Output:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

and

import sklearn.svm as sksvm
import sklearn.grid_search as skgs

params = { 'C' : [ 0.01 , 0.1 , 1 , 10]}
X = np.random.rand(1000, 10)  # (1000 x 10) matrix, 1000 points with 10 features
Y = np.random.randint(0, 2, 1000)  # 1000 array, binary labels

mod = skgs.GridSearchCV(sksvm.SVC(), params, n_jobs=-1)
mod.fit(X, Y)

Output:

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=-1,
       param_grid={'C': [0.01, 0.1, 1, 10]}, pre_dispatch='2*n_jobs',
       refit=True, score_func=None, scoring=None, verbose=0)

If your data is in dataframe and series the code still works, you can try it by adding:

X = pd.DataFrame(X)
Y = pd.Series(Y)

after you generate X and Y.

Difficult to say without a reproducible piece of code though. Also you probably should add the label sklearn to the question.

Borja
  • 1,411
  • 11
  • 20
  • i think the problem is passing the label as a DataFrame , so what should i do . – Anurag Pandey Aug 10 '16 at 16:42
  • i have used as_type.array() nothing . @Borja i am losing my sanity over this . please see to it – Anurag Pandey Aug 10 '16 at 16:51
  • You should extract the Series with the labels from the dataframe. Either `label[column_name]` if you know the name of the column, or `label.iloc[:, 0]` if it's the first column or if there's only one column. – Borja Aug 10 '16 at 16:53
  • That is: `clf.fit(train , label.iloc[:,0])` – Borja Aug 10 '16 at 16:55
  • btw what was i doing wrong here ? ( just a last question ) – Anurag Pandey Aug 10 '16 at 17:05
  • DataFrames and Series are different objects which use numpy arrays under the hood. A dataframe is a collection of series and the data is stored as a 2D numpy array, whereas a Series is stored as a 1D numpy array. Normally sklearn expects a 1D array as labels, so I'm guessing when you were feeding it a DataFrame (hence a 2D array) it was somehow getting confused. However I'm not sure why it was working with the SVC() and not the GridSearchCV(), or why with my randomly generated data it was working on both. – Borja Aug 13 '16 at 11:16
  • You're welcome :) if this answered your question then please select the answer as accepted answer – Borja Aug 14 '16 at 12:12